Why does joblib parallel execution make runtime much slower?

Question

I want to shuffle values in a 3D numpy-array, but only when they are > 0.

When I run my function with a single core, it is much faster than with even 2 cores. It is way beyond the overhead of creating new python processes. What am I missing?

The following code outputs:

random shuffling of markers started
time in serial execution:                          1.0288s
time executing in parallel with num_cores=1:       0.9056s
time executing in parallel with num_cores=2:     273.5253s

import numpy as np
import time
from random import shuffle
from joblib import Parallel, delayed  
import multiprocessing

import numpy as np

def randomizeVoxels(V,markerLUT):
    V_rand=V.copy()
    # the xyz naming here does not match outer convention, which will depend on permutation
    for ix in range(V.shape[0]):
        for iy in range(V.shape[1]):
            if V[ix,iy]>0:
                V_rand[ix,iy]=markerLUT[V[ix,iy]]

    return V_rand

V_ori=np.arange(1000000,-1000000,-1).reshape(100,100,200)

V_rand=V_ori.copy()

listMarkers=np.unique(V_ori)
listMarkers=[val for val in listMarkers if val>0]

print("random shuffling of markers started\n")

reassignedMarkers=listMarkers.copy()
#random shuffling of original markers
shuffle(reassignedMarkers)

markerLUT={}
for i,iMark in enumerate(listMarkers):
    markerLUT[iMark]=reassignedMarkers[i]

tic=time.perf_counter()

for ix in range(len(V_ori)):
    for iy in range(len(V_ori[0])):
        for iz in range(len(V_ori[0][0])):
            if V_ori[ix,iy,iz]>0:
                V_rand[ix,iy,iz]=markerLUT[V_ori[ix,iy,iz]]

toc=time.perf_counter()

print("time in serial execution: \t\t\t{: >4.4f} s".format(toc-tic))

#######################################################################3

num_cores = 1

V_rand=V_ori.copy()

tic=time.perf_counter()

results= Parallel(n_jobs=num_cores)\
    (delayed(randomizeVoxels)\
        (V_ori[imSlice,:,:],
        markerLUT
        )for imSlice in range(V_ori.shape[0]))

for i,resTuple in enumerate(results):
    V_rand[i,:,:]=resTuple

toc=time.perf_counter() 

print("time executing in parallel with num_cores={}:\t{: >4.4f} s".format(num_cores,toc-tic))

num_cores = 2

V_rand=V_ori.copy()

MASK = "time executing in parallel with num_cores={}:\t {: >4.4f}s"

tic=time.perf_counter() #----------------------------- [PERF-me]

results= Parallel(n_jobs=num_cores)\
    (delayed(randomizeVoxels)\
        (V_ori[imSlice,:,:],
        markerLUT
        )for imSlice in range(V_ori.shape[0]))

for i,resTuple in enumerate(results):
    V_rand[i,:,:]=resTuple

toc=time.perf_counter() #----------------------------- [PERF-me]

print( MASK.format(num_cores,toc-tic) )

user3666197 · Answer 1 · 2020-11-26T19:52:37.133

Q : "What am I missing?"

Most probably the memory-I/O bottlenecks.

While the numpy-part of the processing seems to be pretty shallow here (shuffle does not compute a bit, but moves data between a pair of locations, doesn't it?), for the most of the time, this will not permit "time-enough" (by doing any useful work) so as to get the memory-I/O-s be masked by re-ordered CPU-core instructions (ref. latency-costs for straight + cross-QPI memory-I/O ops at the lowest levels of the contemporary super-scalar CISC architectures with highly speculative branch predictions (not useful for memory-I/O bound non-branching tightly crafted sections) and multi-core and many-core NUMA designs ).

This is most probably why even the first spin-off concurrent process (no matter if enforced for camping on the same _{(here a shared-CPU-core time by an interleaving pair of a two-step dancing processes, again memory-I/O bound, with even worse chances for latency masking on shared memory-I/O channels...)} or any other _{(here adding cross-QPI add-on latency costs if having to perform non-local memory-I/O, again worsening chances for memory-I/O latency-masking)} CPU-core.

CPU-core hopping, enforced by the colliding effects of the CPU-clock Boost policy (later starting to violate the Thermal Management, thus hopping the process to camp on a next, colder CPU-core) will invalidate all CPU-core cache benefits, by not having the pre-cached data on the next, colder, core available, thus having to re-fetch all (once pre-cached already into the fastest L1data cache) data again (perhaps, for array-objects with larger memory footprints, having even a need to cross-QPI fetch), so harnessing more cores does not have a trivial effect on the resulting efficiency.

;o)
The numpy high performance & smart processing is here not the one to be blamed - the very opposite - it clearly demasks the CPU "starvation" state - known for ages to be The Very Performance Ceiling for all our modern CPUs - this is why we see so many-core CPUs, that try to circumvent this bottleneck by having more and more cores - see the commented silicon-level analysis referenced above.

Last but not least
the code as-is contains immense count of opportunities to improve it's performance, numpy-smart-vectorised being the first one to name, avoiding range()-loops, so there are more tips to follow, all of which will finally ring the headbang into the very same trouble - the CPU-starvation ceiling

Why does joblib parallel execution make runtime much slower?

1 Answers1

Linked