scikit-learn n_jobs parameter on CPU usage & memory

Question

In most estimators on scikit-learn, there is an n_jobs parameter in fit/predict methods for creating parallel jobs using joblib. I noticed that setting it to -1 creates just 1 Python process and maxes out the cores, causing CPU usage to hit 2500 % on top. This is quite different from setting it to some positive integer >1, which creates multiple Python processes at ~100 % usage.

How does setting it affects CPU & core usage on a multi-CPU Linux server? (e.g. if n_jobs=8 then are 8 CPUs fully locked up or do the CPUs still reserve some cores for other tasks/processes?)

Additionally, I do get MemoryError occasionally when setting n_jobs=-1 for large datasets. However, the memory usage usually hovers at around 30-40 % for the single Python process. How is the data & memory being managed/copied depending on the value of n_jobs?

Remember as well you can set it to -2, which will use all but 1 of the available cores, leaving your machine at least somewhat working. Quite correct that memory issues usually start to bite for many cores, especially if the datasets large — Ken Syme, Jul 13 '18 at 20:34

n1k31t4 · Answer 1 · 2018-07-16T12:08:21.740

I can imagine a value of -1 consumes all available resources as and when they become available. Depending on which function you are talking about, it seems data is copied for each of the jobs, which can lead to memory problems if the dataset is large enough. Here is a snippet of information from the docstring of GridSearchCV:

If `n_jobs` was set to a value higher than one, the data is copied for each
point in the grid (and not `n_jobs` times). This is done for efficiency
reasons if individual jobs take very little time, but may raise errors if
the dataset is large and not enough memory is available.  A workaround in
this case is to set `pre_dispatch`. Then, the memory is copied only
`pre_dispatch` many times. A reasonable value for `pre_dispatch` is `2 *
n_jobs`.

So it might be a good idea to use pre_dispatch to put an upper limit on your memory consumption.

Otherwise, why are you setting it to -1? You should just set it to the number of physical cores on your machine, or maybe 2 times that number, if the task can be multi-threaded.

EDIT:

It seems setting n_jobs=-1 does indeed just select all physical cores and maximises their usage. Have a look at the comments in this answer on StackOverflow.

If you have not set pre_dispatch, it will of course try to copy a lot. This is why you run out of memory. If you have 4 cores there will be, by default, 8 copies of the dataset made (as described above in the quote).

Here is another thread, which looks more at the performance side

so we use pre_dispatch to limit the copies of the data, but why set to -1 there is a memory issue? — , Jul 16 '18 at 01:23
@sweetyBaby - please see the added links. Setting n_jobs = -1 will not take you memory into consideration, only the number of cores on your CPU, which can of course lead to memory issues. — n1k31t4, Jul 16 '18 at 12:08

scikit-learn n_jobs parameter on CPU usage & memory

1 Answers1

Linked