Since you specify CPU specifically it might be worth considering pyFFTW and python subprocess. In the past i have had some success splitting the array into sub arrays. Then launching concurrent processes for a subset of the subarrays which inturn will run an instance of pyFFTW. PyFFTW is a python wrapper over FFTW and in my experience has been faster than numpy fft. Do read pyFFTW documentation on enabling caches etc to optimize performance. Shall post some code here later when I am at my work computer. But there is a catch too many processes can slow things down.
below is an example of what I've been on about. the fft_split function splits the problem(dat) into num_prcss number of processes and engages multiple processes to process them as a batch. I use pyFFTW here but numpy fft can be used if you decide to run this. Play with the num_prcss variable from to see speed up.
import numpy as np
import pyfftw as pyfftw
import multiprocessing as mp
import time as time
dat = np.random.rand(50000).reshape(50,-1)
num_prcss = 8
threads = 8
output = mp.Queue()
def fft_sub_process(dat_in, row, output):
match_out = pyfftw.interfaces.numpy_fft.ifft(dat_in, threads=threads)
match_abs = np.abs(match_out)
max_match_fltr = np.max(match_abs)
print([row, max_match_fltr])
output.put([row, max_match_fltr])
def fft_split(dat, num_prcss):
results = []
for r in xrange(0, dat.shape[-2], num_prcss):
if r == dat.shape[-2] - dat.shape[-2]%num_prcss:
processes = [mp.Process(target=fft_sub_process, args=(dat[row], row, output)) for row in xrange(r, r+dat.shape[-2]%num_prcss)]
for p in processes:
p.start()
for p in processes:
p.join()
results.append([output.get() for p in processes])
else:
processes = [mp.Process(target=fft_sub_process, args=(dat[row], row, output)) for row in xrange(r, r+num_prcss)]
for p in processes:
p.start()
for p in processes:
p.join()
results.append([output.get() for p in processes])
return results
t = time.time()
search_res = fft_split(dat, num_prcss)
elapsed = time.time() - t
print(elapsed)
timeit. 2. scipy's should be faster than numpy, we spent a lot of time optimizing it (real FFT method, padding to 5-smooth lengths, using direct convolution when one input is much smaller, etc.) Don't know how it compares to tensorflow.