Why `vectorize` is outperformed by `frompyfunc`?

Question

Numpy offers vectorize and frompyfunc with similar functionalies.

As pointed out in this SO-post, vectorize wraps frompyfunc and handles the type of the returned array correctly, while frompyfunc returns an array of np.object.

However, frompyfunc outperforms vectorize consistently by 10-20% for all sizes, which can also not be explained with different return types.

Consider the following variants:

import numpy as np

def do_double(x):
    return 2.0*x

vectorize = np.vectorize(do_double)

frompyfunc = np.frompyfunc(do_double, 1, 1)

def wrapped_frompyfunc(arr):
    return frompyfunc(arr).astype(np.float64)

wrapped_frompyfunc just converts the result of frompyfunc to the right type - as we can see, the costs of this operation are almost neglegible.

It results in the following timings (blue line is frompyfunc):

I would expect vectorize to have more overhead - but this should be seen only for small sizes. On the other hand, converting np.object to np.float64 is also done in wrapped_frompyfunc - which is still much faster.

How this performance difference can be explained?

Code to produce timing-comparison using perfplot-package (given the functions above):

import numpy as np
import perfplot
perfplot.show(
    setup=lambda n: np.linspace(0, 1, n),
    n_range=[2**k for k in range(20,27)],
    kernels=[
        frompyfunc, 
        vectorize, 
        wrapped_frompyfunc,
        ],
    labels=["frompyfunc", "vectorize", "wrapped_frompyfunc"],
    logx=True,
    logy=False,
    xlabel='len(x)',
    equality_check = None,  
    )

NB: For smaller sizes, the overhead of vectorize is much higher, but that is to be expected (it wraps frompyfunc after all):

Notes ----- The `vectorize` function is provided primarily for convenience, not for performance. The implementation is essentially a for loop. If `otypes` is not specified, then a call to the function with the first argument will be used to determine the number of outputs. The results of this call will be cached if `cache` is `True` to prevent calling the function twice. However, to implement the cache, the original function must be wrapped which will slow down subsequent calls, so only do this if your function is expensive. — Giacomo Alzetta, Jul 29 '19 at 12:27
From your own statement »*`vectorize` wraps `frompyfunc` **and** …*« it seems pretty obvious to me that `vectorize` is slower. `vectorize` (1) calls `frompyfunc` and additionally (2) does some other stuff. Why did you expect that doing (1)+(2) could be faster than doing just (2)? — Socowi, Jul 29 '19 at 12:31
@Socowi because I'm doing (2) in `wrapped_frompyfunc` and it is still much faster than `vectorize`. — ead, Jul 29 '19 at 12:54
@ead your "wrapped function" isn't at all comparable to what `vectorize` does. Nor is it comparable to the additional overhead that `vectorize` encurs. Of course that example is faster than `vectorize` — user3483203, Jul 29 '19 at 12:58
I think you need to examine `vectorize` code in detail. Reasoning and timings aren't enough. — hpaulj, Jul 29 '19 at 14:48
I have a feeling that `np.vectorize` (and possibly `frompyfunc`) has been reworked in recent versions. At least some times suggest that. I used find that `frompyfunc` was 2x faster than a list comprehension, and `vectorize` somewhat slower than the comprehension. But I haven't found any release notes or pushes to verify that. — hpaulj, Jul 29 '19 at 15:57
For small cases, `np.vectorize` can be improved by providing `otypes`. That way it doesn't have to run a test calculation to determine the return `dtype`. — hpaulj, Jul 29 '19 at 19:15
The `vectorize` class is defined in `https://github.com/numpy/numpy/blob/master/numpy/lib/function_base.py`. `frompyfunc` is setup in `_get_ufunc_and_otypes`, which it returns as `ufunc`. Which in turn is used by `_vectorize_call`. — hpaulj, Jul 29 '19 at 19:43
@hpaulj thanks for the hints - I have used them to dig a little bit deeper. if I understand the code correctly, prior to 1.7 caching was used - this would lead to more overhead for every python-function (about factor 2). Here is the change https://github.com/numpy/numpy/commit/127ae2f54d2c96fc7318fe12a3e2009e517828d1 — ead, Jul 29 '19 at 21:51

score 3 · Accepted Answer · answered Jul 29 '19 at 21:39

Following the hints of @hpaulj we can profile the vectorize-function:

arr=np.linspace(0,1,10**7)
%load_ext line_profiler

%lprun -f np.vectorize._vectorize_call \
       -f np.vectorize._get_ufunc_and_otypes  \
       -f np.vectorize.__call__  \
       vectorize(arr)

which shows that 100% of time is spent in _vectorize_call:

Timer unit: 1e-06 s

Total time: 3.53012 s
File: python3.7/site-packages/numpy/lib/function_base.py
Function: __call__ at line 2063

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  2063                                               def __call__(self, *args, **kwargs):
  ...                                         
  2091         1    3530112.0 3530112.0    100.0          return self._vectorize_call(func=func, args=vargs)

...

Total time: 3.38001 s
File: python3.7/site-packages/numpy/lib/function_base.py
Function: _vectorize_call at line 2154

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  2154                                               def _vectorize_call(self, func, args):
  ...
  2161         1         85.0     85.0      0.0              ufunc, otypes = self._get_ufunc_and_otypes(func=func, args=args)
  2162                                           
  2163                                                       # Convert args to object arrays first
  2164         1          1.0      1.0      0.0              inputs = [array(a, copy=False, subok=True, dtype=object)
  2165         1     117686.0 117686.0      3.5                        for a in args]
  2166                                           
  2167         1    3089595.0 3089595.0     91.4              outputs = ufunc(*inputs)
  2168                                           
  2169         1          4.0      4.0      0.0              if ufunc.nout == 1:
  2170         1     172631.0 172631.0      5.1                  res = array(outputs, copy=False, subok=True, dtype=otypes[0])
  2171                                                       else:
  2172                                                           res = tuple([array(x, copy=False, subok=True, dtype=t)
  2173                                                                        for x, t in zip(outputs, otypes)])
  2174         1          1.0      1.0      0.0          return res

It shows the part I have missed in my assumptions: the double-array is converted to object-array entirely in a preprocessing step (which is not a very wise thing to do memory-wise). Other parts are similar for wrapped_frompyfunc:

Timer unit: 1e-06 s

Total time: 3.20055 s
File: <ipython-input-113-66680dac59af>
Function: wrapped_frompyfunc at line 16

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    16                                           def wrapped_frompyfunc(arr):
    17         1    3014961.0 3014961.0     94.2      a = frompyfunc(arr)
    18         1     185587.0 185587.0      5.8      b = a.astype(np.float64)
    19         1          1.0      1.0      0.0      return b

When we take a look at peak memory consumption (e.g. via /usr/bin/time python script.py), we will see, that the vectorized version has twice the memory consumption of frompyfunc, which uses a more sophisticated strategy: The double-array is handled in blocks of size NPY_BUFSIZE (which is 8192) and thus only 8192 python-floats (24bytes+8byte pointer) are present in memory at the same time (and not the number of elements in array, which might be much higher). The costs of reserving the memory from the OS + more cache misses is probably what leads to higher running times.

My take-aways from it:

the preprocessing step, which converts all inputs into object-arrays, might be not needed at all, because frompyfunc has an even more sophisticated way of handling those conversions.
neither vectorize no frompyfunc should be used, when the resulting ufunc should be used in "real code". Instead one should either write it in C or use numba/similar.

Calling frompyfunc on the object-array needs less time than on the double-array:

arr=np.linspace(0,1,10**7)
a = arr.astype(np.object)
%timeit frompyfunc(arr)  # 1.08 s ± 65.8 ms
%timeit frompyfunc(a)    # 876 ms ± 5.58 ms

However, the line-profiler-timings above have not shown any advantage for using ufunc on objects rather than doubles: 3.089595s vs 3014961.0s. My suspision is that it is due to more cache misses in the case when all objects are created vs. only 8192 created objects (256Kb) are hot in L2 cache.

Why `vectorize` is outperformed by `frompyfunc`?

1 Answers1

Linked