Use nltk.corpus multithreaded

Question

I would like to access nltk.corpus.wordnet in a multithreaded environment. As soon as I enable multithreading, methods such as synsets() fail. If I disable it, everything works fine.

The error messages change. For example, an error could look like this, which looks very much like a race condition to me:

File "/home/lhk/anaconda3/envs/dlab/lib/python3.6/site-packages/nltk/corpus/reader/wordnet.py", line 1342, in synset_from_pos_and_offset
    assert synset._offset == offset

There are other questions about this:

The problem here was also caused by multithreading: What would cause WordNetCorpusReader to have no attribute LazyCorpusLoader?
This question has a more general title but seems to describe the same problem (multithreaded corpus loading fails): Python NLTK multi threading
There is an issue about this: https://github.com/nltk/nltk/issues/1576

The solution to the first linked question was to load the corpus before your program branches up into individual threads. I've done that: wordnet.ensure_loaded() is called before the multithreading.

The recommendation in the GitHub issue is to import wordnet within my threaded function. But that doesn't change anything.

score 0 · Answer 1 · answered Nov 18 '18 at 10:26

0

A workaround is to make a deep copy of the corpus, for every thread. Of course this needs lots of memory and is not very efficient:

import copy
from nltk.corpus import wordnet as wn
wn.ensure_loaded()

# at the beginning of the multi-threaded environment
my_wn = copy.deepcopy(wn)

answered Nov 18 '18 at 10:26

lhk

23,332
25
107
176

2

This solution appears to generate `LookupError` for other corpus datasets that are not downloaded. I have not been able to make `deepcopy(wn)` work successfully in any case. – ely Jul 26 '19 at 13:06

Use nltk.corpus multithreaded

1 Answers1