Following the hints of @hpaulj we can profile the vectorize-function:
arr=np.linspace(0,1,10**7)
%load_ext line_profiler
%lprun -f np.vectorize._vectorize_call \
-f np.vectorize._get_ufunc_and_otypes \
-f np.vectorize.__call__ \
vectorize(arr)
which shows that 100% of time is spent in _vectorize_call:
Timer unit: 1e-06 s
Total time: 3.53012 s
File: python3.7/site-packages/numpy/lib/function_base.py
Function: __call__ at line 2063
Line # Hits Time Per Hit % Time Line Contents
==============================================================
2063 def __call__(self, *args, **kwargs):
...
2091 1 3530112.0 3530112.0 100.0 return self._vectorize_call(func=func, args=vargs)
...
Total time: 3.38001 s
File: python3.7/site-packages/numpy/lib/function_base.py
Function: _vectorize_call at line 2154
Line # Hits Time Per Hit % Time Line Contents
==============================================================
2154 def _vectorize_call(self, func, args):
...
2161 1 85.0 85.0 0.0 ufunc, otypes = self._get_ufunc_and_otypes(func=func, args=args)
2162
2163 # Convert args to object arrays first
2164 1 1.0 1.0 0.0 inputs = [array(a, copy=False, subok=True, dtype=object)
2165 1 117686.0 117686.0 3.5 for a in args]
2166
2167 1 3089595.0 3089595.0 91.4 outputs = ufunc(*inputs)
2168
2169 1 4.0 4.0 0.0 if ufunc.nout == 1:
2170 1 172631.0 172631.0 5.1 res = array(outputs, copy=False, subok=True, dtype=otypes[0])
2171 else:
2172 res = tuple([array(x, copy=False, subok=True, dtype=t)
2173 for x, t in zip(outputs, otypes)])
2174 1 1.0 1.0 0.0 return res
It shows the part I have missed in my assumptions: the double-array is converted to object-array entirely in a preprocessing step (which is not a very wise thing to do memory-wise). Other parts are similar for wrapped_frompyfunc:
Timer unit: 1e-06 s
Total time: 3.20055 s
File: <ipython-input-113-66680dac59af>
Function: wrapped_frompyfunc at line 16
Line # Hits Time Per Hit % Time Line Contents
==============================================================
16 def wrapped_frompyfunc(arr):
17 1 3014961.0 3014961.0 94.2 a = frompyfunc(arr)
18 1 185587.0 185587.0 5.8 b = a.astype(np.float64)
19 1 1.0 1.0 0.0 return b
When we take a look at peak memory consumption (e.g. via /usr/bin/time python script.py), we will see, that the vectorized version has twice the memory consumption of frompyfunc, which uses a more sophisticated strategy: The double-array is handled in blocks of size NPY_BUFSIZE (which is 8192) and thus only 8192 python-floats (24bytes+8byte pointer) are present in memory at the same time (and not the number of elements in array, which might be much higher). The costs of reserving the memory from the OS + more cache misses is probably what leads to higher running times.
My take-aways from it:
- the preprocessing step, which converts all inputs into object-arrays, might be not needed at all, because
frompyfunc has an even more sophisticated way of handling those conversions.
- neither
vectorize no frompyfunc should be used, when the resulting ufunc should be used in "real code". Instead one should either write it in C or use numba/similar.
Calling frompyfunc on the object-array needs less time than on the double-array:
arr=np.linspace(0,1,10**7)
a = arr.astype(np.object)
%timeit frompyfunc(arr) # 1.08 s ± 65.8 ms
%timeit frompyfunc(a) # 876 ms ± 5.58 ms
However, the line-profiler-timings above have not shown any advantage for using ufunc on objects rather than doubles: 3.089595s vs 3014961.0s. My suspision is that it is due to more cache misses in the case when all objects are created vs. only 8192 created objects (256Kb) are hot in L2 cache.