Looking for Fastest 2D Convolution in Python on a CPU

Question

Convolutions are essential components of many algorithms in neural networks, image processing, computer vision ... but these are also a bottleneck in terms of computations... In the python ecosystem, there are different existing solutions using numpy, scipy or tensorflow, but which is the fastest?

Just to set the problem, the convolution should operate on two 2-D matrices. We will here always consider the case which is most typical in computer vision:

a first matrix $A$ is the input and is typically large ($N \times N$ where $N$ is typically larger than $2^{10}=1024$),
a second matrix $B$ is the template and is typically smaller (say $M=128$),
the result of the convolution $C = A \ast B$ is padded such that it is of the same size as $A$.

Often, you need to do that with many images on many kernels, so a method that does it on many has a bonus point.

Thanks for any hint!

PS: the goal is at some point to sum up these results in this blog post which already contains some examples...

Do the same. Instead of Multiplication in Frequency do Division. — Royi, Sep 26 '17 at 16:19
You could also try OpenCV, which has inbuilt algorithms for that (and they are usually quite fast) http://docs.opencv.org/3.1.0/d4/d13/tutorial_py_filtering.html — Julian S., Sep 27 '17 at 02:09

meduz · Accepted Answer · 2021-10-06T08:23:40.550

7

To provide a more quantitative approach, I have made a Jupyter notebook (which can be seen as a web page here).

The results can be summarized in the following plot:

In practice, I have found that numpy was always faster by a significant amount. However, it seems new alternatives are emerging and I have to include (not exhaustive): architecture specific optimizations (IPP), simplified graphs in deep-learning algorithms (such as PyTorch), and so on. Comments are welcome!

edited Oct 06 '21 at 08:23

answered Nov 21 '17 at 09:19

meduz

816
6
15

As I said before, scipy has several optimization paths to choose the fastest method based on the input sizes and types, so it should be pretty good. (I also had some shortcuts for very small N which didn't make it into the finished PR, which I meant to re-submit later but haven't gotten around to it.) If you're doing lots of convolutions of the same size, you can use scipy.signal.choose_conv_method(in1, in2, mode='full', measure=True) to find the best method empirically. – endolith Nov 21 '17 at 16:12
Thanks ! This looks to be a good method to benchmark a cnovolution code beforehand. – meduz Nov 26 '17 at 15:21
I know this is an old thread, but I found your blog post super useful and wanted to ask about the pure numpy solution. Have you checked to see that creates the same result as the scipy solutions? When I run the code I get very different results for the numpy solution than the scipy solution. numpy version 1.15.4 scipy version 1.2.0 – aorr Feb 12 '19 at 18:57
As you can see, I also get different (faster) results with numpy - what is the difference you get? or do you mean that the numerical results are different? – meduz Feb 18 '19 at 09:02
2

The 2 links to the blog are dead. – Mark Oct 04 '21 at 16:45
thanks for the notice, I have updated the links to their correct locations – meduz Oct 06 '21 at 08:24

Royi · Answer 2 · 2019-03-15T13:43:32.083

It depends on the sizes of the images and the filters.
Sometimes it also depends on the filters themselves and the quality required.

Assuming all arbitrary (Namely the filters have no special property but their size, some of them are HPF, some LPF, some neither, they are not separable, no approximation is allowed, etc...) one could follow this:

If the filters are small in comparison to the image, usually direct computation is the way to go if the filter is used once.
If the filter is long or used many times for many images it is better to do it in Frequency Domain. Pay attention you need padding in order to apply linear Convolution using Frequency Domain Multiplication (Cyclic Convolution). Also try to take advantage of all data being real (The Symmetry in Frequency Domain).
If approximation is allowed a separable approximation of the filter may produce great speed up's.
If you use direct convolution utilizing Intel IPP will yield the fastest results.
If you use Frequency Domain then wither IPP or FFTW will yield the fastest results (In the case of FFTW you still need to do the frequency domain multiplication efficiently using IPP or hand coded code).

Those guidelines will easily get you close to your system edge regarding performance.
Usually with small kernels convolution is memory bounded operation.

By the way, probably the fastest filtering out there is in Intel IPP.
You can use that either by installing Intel Python Distribution or utilizing IPP in Cython.

zoulzubazz · Answer 3 · 2019-03-18T15:47:06.073

Since you specify CPU specifically it might be worth considering pyFFTW and python subprocess. In the past i have had some success splitting the array into sub arrays. Then launching concurrent processes for a subset of the subarrays which inturn will run an instance of pyFFTW. PyFFTW is a python wrapper over FFTW and in my experience has been faster than numpy fft. Do read pyFFTW documentation on enabling caches etc to optimize performance. Shall post some code here later when I am at my work computer. But there is a catch too many processes can slow things down. below is an example of what I've been on about. the fft_split function splits the problem(dat) into num_prcss number of processes and engages multiple processes to process them as a batch. I use pyFFTW here but numpy fft can be used if you decide to run this. Play with the num_prcss variable from to see speed up.

import numpy as np
import pyfftw as pyfftw
import multiprocessing as mp
import time as time

dat = np.random.rand(50000).reshape(50,-1)

num_prcss = 8
threads = 8

output = mp.Queue()

def fft_sub_process(dat_in, row, output):
    match_out = pyfftw.interfaces.numpy_fft.ifft(dat_in, threads=threads)
    match_abs = np.abs(match_out)
    max_match_fltr = np.max(match_abs)
    print([row, max_match_fltr])
    output.put([row, max_match_fltr])

def fft_split(dat, num_prcss):
    results = []
    for r in xrange(0, dat.shape[-2], num_prcss):    
        if r == dat.shape[-2] - dat.shape[-2]%num_prcss:
            processes = [mp.Process(target=fft_sub_process, args=(dat[row], row, output)) for row in xrange(r, r+dat.shape[-2]%num_prcss)]
            for p in processes:
                p.start()
            for p in processes:
                p.join()    
            results.append([output.get() for p in processes])
        else:
            processes = [mp.Process(target=fft_sub_process, args=(dat[row], row, output)) for row in xrange(r, r+num_prcss)]
            for p in processes:
                p.start()
            for p in processes:
                p.join()
            results.append([output.get() for p in processes])

    return results

t = time.time()

search_res = fft_split(dat, num_prcss)

elapsed = time.time() - t

print(elapsed)

This is not true. See my answer to understand when doing it in Frequency Domain might be faster. — Royi, Mar 15 '19 at 22:59
though potentially interesting, this sound more like a comment than an answer. could you provide some code such that we may evaluate by ourselves what you say? — meduz, Mar 17 '19 at 08:57

Looking for Fastest 2D Convolution in Python on a CPU

3 Answers3