16

I use NLTK with wordnet in my project. I did the installation manually on my PC, with pip: pip3 install nltk --user in a terminal, then nltk.download() in a python shell to download wordnet.

I want to automatize these with a setup.py file, but I don't know a good way to install wordnet.

For the moment, I have this piece of code after the call to setup ("nltk" is in the install_requires list of the call to setup):

import sys
if 'install' in sys.argv:
    import nltk
    nltk.download("wordnet")

Is there a better way to do this?

Arne
  • 13,772
  • 4
  • 65
  • 83
Tom Cornebize
  • 1,283
  • 11
  • 31
  • @martin-thoma from a quick glance, looks like the _nltk data_ dependencies could be packaged as Python projects and distributed on PyPI without too much work. The whole thing could be relatively easily scripted and delegated to a CI/CD system. You should weigh in on these tickets: https://github.com/nltk/nltk_data/issues/12 https://github.com/nltk/nltk/issues/2228 – sinoroc Oct 12 '19 at 15:11
  • @martin-thoma also, here is a rather similar post I wrote about the same problem with spacy: https://stackoverflow.com/questions/57773454/package-spacy-model/57782864#57782864 does that apply to your situation as well? – Arne Oct 14 '19 at 07:13
  • For my use case, the best option seemed to be to list all dependencies in a `requirements.txt` file and use `pip install -r requirements.txt` first. Then in my `setup.py` I have the manual download command `nltk.download("punkt")` which is used when I run `pip install -e .` I believe this works because I'm building a Docker image/container, not trying to distribute a package. – rkechols Jan 28 '22 at 23:46

2 Answers2

13

I managed to install the NLTK data in setup.py by overriding cmdclass with my own Install class :

from setuptools import setup, find_packages
from setuptools.command.install import install as _install


class Install(_install):
    def run(self):
        _install.do_egg_install(self)
        import nltk
        nltk.download("popular")

setup(...
    cmdclass={'install': Install},
    ...
    install_requires=[
      'nltk',
      ],
    setup_requires=['nltk']
    ...
   )

It is important to use the method do_egg_install() in your run() method to make sure nltk gets installed, before import nltk is called (See also here python setuptools install_requires is ignored when overriding cmdclass). Also don't forget to add nltk to setup_requires.

alvas
  • 105,505
  • 99
  • 405
  • 683
asmaier
  • 10,347
  • 10
  • 72
  • 97
3

You can also automate installation with a shell script, for example, running (after pip installing nltk):

python -m nltk.downloader -d /usr/share/nltk_data wordnet
transcranial
  • 371
  • 2
  • 3