8

I am looking for a machine readable data set (e.g., a mere txt file with one word per row) containing the X most common words in English, excluding proper nouns (unlike this list). If possible, ordered by frequency.

Franck Dernoncourt
  • 7,780
  • 9
  • 39
  • 86
  • 1
    You can get the top ten hundred words from http://xkcd.com/simplewriter/words.js ...I thought it had duplicates (I'm, I'd, I'll), but it looks like he has all contractions w/ two different types of apostrophes. – Joe Mar 11 '16 at 14:54
  • 1
    Because English capitalizes proper nouns and not all nouns, like German, it's fairly easy to do this with a few lines of code. This example does it for characters, but the switch to "words that don't start with a capital letter, except I" is easy enough http://opendata.stackexchange.com/a/7043/1511 – philshem Mar 12 '16 at 07:24

5 Answers5

6

It's not entirely open but the best I've found in this vein is www.wordfrequency.info - I am pretty sure that the 5000 word list is free, with fees for larger lists.

Joe Germuska
  • 5,488
  • 20
  • 46
3

There are word frequency lists for a number of languages, including English, that were created by Hermit Dave from the data at opensubtitles.org.

You can download them for free on his blog. The licence is: Creative Commons – Attribution / ShareAlike 3.0 license.

Suzana
  • 406
  • 2
  • 11
2

You can also checkout this list at http://www.wordgamedictionary.com/word-lists/common-english-words/

Cesar Bielich
  • 219
  • 2
  • 4
2

If you need more accurate and representative frequency counts of American English check out the Corpus of Contemporary American English which offers two different-length word lists (60k,100k) that include part-of-speech and frequency distribution across spoken and written registers of speech.
enter image description here

0

See resources at Wiktionary:Frequency Lists

In particular I have used Leipzig Corpora Collection. Use cut -f2 if you just want the words and not the rank or frequency.

qwr
  • 101
  • 2