1

I would like to access nltk.corpus.wordnet in a multithreaded environment. As soon as I enable multithreading, methods such as synsets() fail. If I disable it, everything works fine.

The error messages change. For example, an error could look like this, which looks very much like a race condition to me:

File "/home/lhk/anaconda3/envs/dlab/lib/python3.6/site-packages/nltk/corpus/reader/wordnet.py", line 1342, in synset_from_pos_and_offset
    assert synset._offset == offset

There are other questions about this:

The solution to the first linked question was to load the corpus before your program branches up into individual threads. I've done that: wordnet.ensure_loaded() is called before the multithreading.

The recommendation in the GitHub issue is to import wordnet within my threaded function. But that doesn't change anything.

Lii
  • 10,777
  • 7
  • 58
  • 79
lhk
  • 23,332
  • 25
  • 107
  • 176

1 Answers1

0

A workaround is to make a deep copy of the corpus, for every thread. Of course this needs lots of memory and is not very efficient:

import copy
from nltk.corpus import wordnet as wn
wn.ensure_loaded()

# at the beginning of the multi-threaded environment
my_wn = copy.deepcopy(wn)
lhk
  • 23,332
  • 25
  • 107
  • 176
  • 2
    This solution appears to generate `LookupError` for other corpus datasets that are not downloaded. I have not been able to make `deepcopy(wn)` work successfully in any case. – ely Jul 26 '19 at 13:06