19

I use pytorch to train huggingface-transformers model, but every epoch, always output the warning:

The current process just got forked. Disabling parallelism to avoid deadlocks... To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)

How to disable this warning?

Hooked
  • 77,871
  • 38
  • 181
  • 253
snowzjy
  • 301
  • 1
  • 2
  • 5

3 Answers3

33

Set the environment variable to the string "false"

either by

TOKENIZERS_PARALLELISM=false

in your shell

or by:

import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

in the Python script

David Buck
  • 3,594
  • 33
  • 29
  • 34
Alec Segal
  • 346
  • 2
  • 2
  • Worked for me. Still, worth having a look at [this answer](https://stackoverflow.com/a/67254879/3873799) that points out that using Fast Tokenizers may be the source of this, and that you may need to be wary of any consequences of using them. – alelom Aug 24 '21 at 11:15
10

I'm going to leave this comment here to help anyone wondering if it is possible to keep the parallelism and save valuable time during training. And also because it is the first stackoverflow page when searching the error directly on Google.

According to this comment on github the FastTokenizers seem to be the issue. Also according to another comment on gitmemory you shouldn't use the tokenizer before forking the process. (which basically means before iterating through your dataloader)

So the solution is to not use FastTokenizers before training/fine-tuning or use the normal Tokenizers.

Check the huggingface documentation to find out if you really need the FastTokenizer.

mingaflo
  • 141
  • 1
  • 8
  • so does this warning message mean that the training/fine-tuning is not happening in a parallel manner? – Ritwik Aug 16 '21 at 19:08
  • According to my experience, yes – mingaflo Aug 17 '21 at 11:22
  • Not according to my experience. I ran two experiments: (a) one with this warning message (b) another without it. I just saved my dataloader from (a) and simply loaded it using ```torch.save()``` and ```torch.load()``` . Both experiments finished in approx same time (1 hour per epoch, for 3 epochs). – Ritwik Aug 23 '21 at 08:40
  • Example how to use the `FastTokenizers` after training and example using "normal" `Tokenizer`? – Alaa M. May 30 '22 at 13:50
  • Why do you want to use FastTokenizers after training? You should use them during training/inference. The docs tell you how to use "normal" Tokenizers. – mingaflo May 30 '22 at 14:24
2

I solved this problem by downgrading huggingface's transfomers library version from 3.0.0 to 2.11.0, and tokenizers library version from 0.8.0rc4 to 0.7.0.

It seems to be a problem of the huggingface's tokenizer library version "0.8.0rc4". Currently, it seems that there is no solution to set TOKENIZERS_PARALLELISM=(true | false) as error message say.

reference : https://github.com/ThilinaRajapakse/simpletransformers/issues/515

han0ah
  • 229
  • 2
  • 12