180

I want to check in a Python program if a word is in the English dictionary.

I believe nltk wordnet interface might be the way to go but I have no clue how to use it for such a simple task.

def is_english_word(word):
    pass # how to I implement is_english_word?

is_english_word(token.lower())

In the future, I might want to check if the singular form of a word is in the dictionary (e.g., properties -> property -> english word). How would I achieve that?

Salvador Dali
  • 199,541
  • 138
  • 677
  • 738
Barthelemy
  • 7,827
  • 6
  • 31
  • 35
  • you can see this page : https://stackoverflow.com/questions/39142778/python-how-to-determine-the-language I recommend the `langid` – Mahdi Ebi Oct 12 '21 at 19:42

12 Answers12

264

For (much) more power and flexibility, use a dedicated spellchecking library like PyEnchant. There's a tutorial, or you could just dive straight in:

>>> import enchant
>>> d = enchant.Dict("en_US")
>>> d.check("Hello")
True
>>> d.check("Helo")
False
>>> d.suggest("Helo")
['He lo', 'He-lo', 'Hello', 'Helot', 'Help', 'Halo', 'Hell', 'Held', 'Helm', 'Hero', "He'll"]
>>>

PyEnchant comes with a few dictionaries (en_GB, en_US, de_DE, fr_FR), but can use any of the OpenOffice ones if you want more languages.

There appears to be a pluralisation library called inflect, but I've no idea whether it's any good.

Kaushik Acharya
  • 1,308
  • 2
  • 14
  • 24
Katriel
  • 114,760
  • 19
  • 131
  • 163
  • 2
    Thank you, I did not know about PyEnchant and it is indeed much more useful for the kind of checks I want to make. – Barthelemy Sep 24 '10 at 16:52
  • It doesn't recognize ? Not a common word, but I know as an abbreviation for , and I do not know . Just wanted to point out that the solution isn't one-size-fits-all and that a different project might require different dictionaries or a different approach altogether. – dmh Apr 22 '12 at 18:02
  • Well, if you want a different dictionary you can always plug one in the back of PyEnchant! Note BTW that even the OED only lists "helo" as obsolete... – Katriel Apr 23 '12 at 19:19
  • How can one use Openoffice languages? – Palash Kumar Apr 14 '14 at 09:12
  • enchant doesnt recognize words like american,chinese,indian and countries's names – Alok Nayak Nov 28 '15 at 11:15
  • This is not exactly what the OP has asked for, though. Enchant is a spellchecker. There are many strings a spellchecker must accept but a dictionary should not include, such as "et" (because in appears in "et al"), "situ" (because it appears in "in situ"), "hominem", etc. Also note that if it is precisely dictionary forms that you are interested in, forms such as "gives" should return False too. – Roozbehan May 23 '16 at 05:36
  • 31
    Package is basically impossible to install for me. Super frustrating. – Monica Heddneck May 25 '17 at 00:43
  • 10
    Enchant is not supported at this time for python 64bit on windows :( https://github.com/rfk/pyenchant/issues/42 – Ricky Boyce Jul 05 '17 at 00:23
  • The "tutorial" link is broken. – Rémy Jan 15 '19 at 09:16
  • 11
    [pyenchant](https://github.com/rfk/pyenchant) is no longer maintained. [pyhunspell](https://github.com/blatinier/pyhunspell) has more recent activity. Also `/usr/share/dict/` and `/var/lib/dict` may be referenced on *nix setups. – pkfm Mar 24 '19 at 01:38
  • 2
    [pyenchant](https://github.com/pyenchant/pyenchant) has apparently picked up a maintainer (August 2021). – Anaksunaman Aug 09 '21 at 18:56
71

It won't work well with WordNet, because WordNet does not contain all english words. Another possibility based on NLTK without enchant is NLTK's words corpus

>>> from nltk.corpus import words
>>> "would" in words.words()
True
>>> "could" in words.words()
True
>>> "should" in words.words()
True
>>> "I" in words.words()
True
>>> "you" in words.words()
True
Sadik
  • 4,085
  • 7
  • 50
  • 87
  • 10
    The same mention applies here too: a lot faster when converted to a set: `set(words.words())` – Iulius Curt Sep 30 '14 at 19:41
  • 1
    watch out as you need to singularise words to get proper results – famargar Sep 06 '18 at 10:44
  • 3
    caution : words like pasta or burger are not found in this list – Paroksh Saxena Jan 10 '19 at 11:37
  • 1
    Actually, no library can cover all English words. Moreover, `words.words()` from `nltk` includes proper nouns (e.g: Abraham) as English words, but they can occur in any language specially if the foreign text is transliterated to English. – hafiz031 Feb 17 '21 at 05:29
  • to be able to use `words` you first need to install it - `import nltk` and then `nltk.download('words')` – SubMachine Jan 21 '22 at 10:59
52

Using NLTK:

from nltk.corpus import wordnet

if not wordnet.synsets(word_to_test):
  #Not an English Word
else:
  #English Word

You should refer to this article if you have trouble installing wordnet or want to try other approaches.

nickb
  • 58,150
  • 12
  • 100
  • 138
Susheel Javadi
  • 2,884
  • 3
  • 31
  • 34
42

Using a set to store the word list because looking them up will be faster:

with open("english_words.txt") as word_file:
    english_words = set(word.strip().lower() for word in word_file)

def is_english_word(word):
    return word.lower() in english_words

print is_english_word("ham")  # should be true if you have a good english_words.txt

To answer the second part of the question, the plurals would already be in a good word list, but if you wanted to specifically exclude those from the list for some reason, you could indeed write a function to handle it. But English pluralization rules are tricky enough that I'd just include the plurals in the word list to begin with.

As to where to find English word lists, I found several just by Googling "English word list". Here is one: http://www.sil.org/linguistics/wordlists/english/wordlist/wordsEn.txt You could Google for British or American English if you want specifically one of those dialects.

kindall
  • 168,929
  • 32
  • 262
  • 294
  • 9
    If you make `english_words` a `set` instead of a `list`, then `is_english_word` will run a lot faster. – dan04 Sep 24 '10 at 16:14
  • I actually just redid it as a dict but you're right, a set is even better. Updated. – kindall Sep 24 '10 at 16:16
  • 1
    You can also ditch `.xreadlines()` and just iterate over `word_file`. – FogleBird Sep 24 '10 at 16:18
  • Thanks for your answer. The reason why I wanted to use wordnet was because I could not find any standard/obvious list of English words including plural. Where would I find such files (with plural included)? – Barthelemy Sep 24 '10 at 16:27
  • 3
    Under ubuntu the packages `wamerican` and `wbritish` provide American and British English word lists as `/usr/share/dict/*-english`. The package info gives http://wordlist.sourceforge.net as a reference. – intuited Sep 24 '10 at 16:45
  • +1 I've tried all the answers here and this one is by far the easiest, fastest, lightest and most reliable way described. Plus if you have a special vocabulary or some slang you want to include you can just add it to your list. – Ryan Epp Dec 28 '13 at 15:56
  • 1
    I find a [GitHub repository](https://github.com/dwyl/english-words) which contains 479k English words. – haolee May 29 '17 at 02:59
8

For a faster NLTK-based solution you could hash the set of words to avoid a linear search.

from nltk.corpus import words as nltk_words
def is_english_word(word):
    # creation of this dictionary would be done outside of 
    #     the function because you only need to do it once.
    dictionary = dict.fromkeys(nltk_words.words(), None)
    try:
        x = dictionary[word]
        return True
    except KeyError:
        return False
Guillaume Jacquenot
  • 10,118
  • 5
  • 41
  • 48
Eb Abadi
  • 455
  • 4
  • 15
8

For All Linux/Unix Users

If your OS uses the Linux kernel, there is a simple way to get all the words from the English/American dictionary. In the directory /usr/share/dict you have a words file. There is also a more specific american-english and british-english files. These contain all of the words in that specific language. You can access this throughout every programming language which is why I thought you might want to know about this.

Now, for python specific users, the python code below should assign the list words to have the value of every single word:

import re
file = open("/usr/share/dict/words", "r")
words = re.sub("[^\w]", " ",  file.read()).split()
file.close()
    
def is_word(word):
    return word.lower() in words
 
is_word("tarts") ## Returns true
is_word("jwiefjiojrfiorj") ## Returns False

Hope this helps!

Linux4Life531
  • 359
  • 5
  • 12
  • 3
    This is a great answer as it avoids having to install a massive NLP library for this simple task. Only comment is in your example you leave the file open - doing this in a `with open(...)` block would be better (or just adding file.close() after you've loaded the words). – mdmjsh Jan 13 '22 at 11:21
  • Updated, thanks :) – Linux4Life531 Feb 20 '22 at 09:29
6

I find that there are 3 package-based solutions to solve the problem. They are pyenchant, wordnet and corpus(self-defined or from ntlk). Pyenchant couldn't installed easily in win64 with py3. Wordnet doesn't work very well because it's corpus isn't complete. So for me, I choose the solution answered by @Sadik, and use 'set(words.words())' to speed up.

First:

pip3 install nltk
python3

import nltk
nltk.download('words')

Then:

from nltk.corpus import words
setofwords = set(words.words())

print("hello" in setofwords)
>>True
Tom M.
  • 105
  • 2
Young Yang
  • 89
  • 1
  • 4
  • Before checking your `input_word`, use `input_word.lower()` to convert it to lowercase. Only lowercase words seem to be present in nltk words list. – Bikash Gyawali Jan 19 '21 at 17:23
3

With pyEnchant.checker SpellChecker:

from enchant.checker import SpellChecker

def is_in_english(quote):
    d = SpellChecker("en_US")
    d.set_text(quote)
    errors = [err.word for err in d]
    return False if ((len(errors) > 4) or len(quote.split()) < 3) else True

print(is_in_english('“办理美国加州州立大学圣贝纳迪诺分校高仿成绩单Q/V2166384296加州州立大学圣贝纳迪诺分校学历学位认证'))
print(is_in_english('“Two things are infinite: the universe and human stupidity; and I\'m not sure about the universe.”'))

> False
> True
grizmin
  • 141
  • 5
  • 1
    This will return true if the text is longer than 3 words and there are less than 4 errors (non-recognised words). In general for my use case those settings work pretty well. – grizmin Aug 02 '19 at 13:54
1

For a semantic web approach, you could run a sparql query against WordNet in RDF format. Basically just use urllib module to issue GET request and return results in JSON format, parse using python 'json' module. If it's not English word you'll get no results.

As another idea, you could query Wiktionary's API.

Community
  • 1
  • 1
burkestar
  • 753
  • 1
  • 4
  • 12
0

use nltk.corpus instead of enchant. Enchant gives ambiguous results. For example : for benchmark and bench-mark enchant is returning true. It should suppose to return false for benchmark.

Anand Kuamr
  • 1
  • 1
  • 1
0

you can see this page :

How to determine language

I recommend the langid

Mahdi Ebi
  • 5
  • 2
  • 1
    While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. - [From Review](/review/late-answers/30061827) – RiveN Oct 12 '21 at 21:02
0

Download this txt file https://raw.githubusercontent.com/dwyl/english-words/master/words_alpha.txt

then create a Set out of it using the following python code snippet that loads about 370k non-alphanumeric words in english

>>> with open("/PATH/TO/words_alpha.txt") as f:
>>>     words = set(f.read().split('\n'))
>>> len(words)
370106

From here onwards, you can check for existence in constant time using

>>> word_to_check = 'baboon'
>>> word_to_check in words
True

Note that this set might not be comprehensive but still gets the job done, user should do quality checks to make sure it works for their use-case as well.

Ayush
  • 403
  • 1
  • 8
  • 21