Julia outperformed by MATLAB?

Question

I have been trying to optimise some code that takes input signal in the form of a 2D array and convolves it with a gaussian of a certain FWHM. The size of this 2D signal array is (671, 2001), and I am convolving along the second axis - the time axis. Where,

time = 0:0.5:1000
nt = length(time)  # nt = 2001

I am convolving with a gaussian of FWHM of 150, corrosponding to a standard deviation of about 63.7. I then pad the initial signal by 3 times the FWHM, setting the new values before the original start index as 0 and the new values after the original end index as the signal at the last time point.

Hence, nt = 3801 after padding.

My MATLAB code calculates this convolution in 117.5043 seconds. This is done through a nested for loop over each of the axis - should be slow in MATLAB. The main bit of theMATLAB code is as follows,

for it = 1:nt
    % scan convolution mask across time
    kernal  = preb*exp(-b*(tc-tc(it)).^2); % convolution mask - Gaussian
    N(it) = sum(kernal)*dt;                % normalise
  
    kernal = kernal/N(it); % ensure convolution integrates to 1

    % calculate convolution for each row
    for iq = 1:Nq
        WWc(iq,it) = dt*sum((WWc0(iq,1:nt).').*kernal(1:nt));
    end
end

I have rewritten this in Julia, except using circulant matrices to construct a gaussian kernal, this effectivley removes the foor loop over time, and the kernal to be applied over all time is now fully defined by one matrix. I then either loop over the rows and multiply this kernal by the signal of each row, or use the mapslices function. The main bit of code in Julia is as follows,

K = Circulant(G.Fx) # construct circulant matrix of gaussian function G.Fx
conv = zeros(nr, nt) # init convolved signal

#for i=1:nr # If using a loop instead of mapslice
#    conv[i, :] = transpose(sC[i, :]) * K
#end

conv = mapslices(x -> K*x, sC; dims=2)

In Julia this convolution takes 368 seconds (almost 3 times as slow as MATLAB), despite using circulant matrices to skip a foor loop and reducing it down to multiplying two arrays of size (1, 3801) and (3801, 3801) for each row.

output of @time:

368.047741 seconds (19.37 G allocations: 288.664 GiB, 14.30% gc time, 0.03% compilation time)

Both are run on the same computer, should Julia not outperform MATLAB here?

UPDATE:

Playing around with the Julia code some more, I have rewritten the matrix multiplication in terms of loops, but it is still slow. The convolution is performed within a function although calls two external constructor types - Gaussian or Lorentzian depending on an input flag. I also load SpecialMatrices.jl for the Circulant constructor. I am new to Julia and not sure if that would make it slow. When using the @time macro on independent bits of the function, it is apparent that every part is really fast up until it hits the loop at the end that actually calculations the convolution. This loop is incredibly slow either way I do it, be that using matrix multiplication over the circulant kernel, a double nested loop in combination with the sum() function, or a triple nested loop that does the sum itself.

using MAT
using SpecialMatrices

wd = " "

inputfile = matopen(wd*"/Signal.mat")
signal = read(inputfile, "pdW")
close(inputfile)

inputfile = matopen(wd*"/Vectors.mat")
qAng = read(inputfile, "qAng")
tvec = read(inputfile, "tt")

struct Gaussian
    Fx:: AbstractVector{Float64}
    sigma:: Float64
    height:: Float64
    fwhm:: Float64
    function Gaussian(fwhm:: Float64, x:: StepRangeLen{Float64}, x0:: Float64)
        sigma = fwhm/2sqrt(2log(2))
        height = 1/sigma*sqrt(2pi)
        Fx = height .* exp.( (-1 .* (x .- x0).^2)./ (2*sigma^2))
        new(Fx, sigma, height, fwhm)
    end
end

struct Lorentzian
    Fx:: AbstractVector{Float64}
    gamma:: Float64
    fwhm:: Float64
    function Lorentzian(fwhm:: Float64, x:: StepRangeLen{Float64}, x0:: Float64)
        gamma = fwhm/2
        Fx = 1 ./ ((pi * gamma) .* (1 .+ ((x .- x0)./ gamma).^2 ))
        new(Fx, gamma, fwhm)
    end
end


function convolve(S:: Array{Float64}, fwhm:: Float64, time:: Array{Float64}, type:: Int)

    nr:: Int64, nc:: Int64 = size(S)
    nt:: Int64 = length(time)
    dt:: Float64 = sum(diff(time, dims=2))/(length(time)-1)
    duration:: Float64 = fwhm*3 # must pad three time FHWM to deal with edges

    if dt != diff(time, dims=2)[1] ; error("Non-linear time range."); end
    if nr != nt && nc != nt; error("Inconsistent dimensions."); end
    if nc != nt; S = transpose(S); nr, nc = nc, nr; end # column always time axis

    tmin:: Float64 = minimum(time); tmax:: Float64 = maximum(time)
    tconv_min:: Float64 = tmin - duration ; tconv_max:: Float64 = tmax + duration
    tconv = tconv_min:dt:tconv_max # extended convoluted time vector
    ntc:: Int64 = length(tconv)
    padend:: Int64 = (duration/dt) # index at which 0 padding ends and signal starts
    padstart:: Int64 = (padend + tmax / dt) # index where signal ends and padding starts

    type == 0 ? Kv = Gaussian(fwhm, tconv, tconv[1]) : Kv = Lorentzian(fwhm, tconv, tconv[1])
    println("Convolving signal with a $(typeof(Kv)) function with FWHM: $(Kv.fwhm) fs.")

    K:: Array{Float64} = Circulant(Kv.Fx) # construct kernel from circ matrix
    sC = zeros(Float64, nr, ntc)
    sC[1:nr, 1:padend] .= S[1:nr, 1] # extended and pad signal
    sC[1:nr, padend:padstart] .= S
    sC[1:nr, padstart:end] .= S[1:nr, end]

    println("""
            Signal padded by 3*FWHM ( $(duration) fs ) forwards and backwards.
            Original time length: $(nt), Extended time: $(ntc), Diff: $(ntc-nt) steps.
            Padded signal size: $(size(sC)).
            Kernel size: $(size(K[1, :])).
            """)
    conv = zeros(Float64, nr, ntc)
    for t in 1:ntc
        for q in 1:nr
            #conv[q, t] = sum(sC[q, 1:ntc] .* K[t, :])*dt
            conv[q, t] = 0.0
            for j in 1:ntc
                conv[q, t] += sC[q, j] * K[t, j] * dt
            end
        end
    end

    return conv, tconv
end

fwhm = 100.0
conv, tconv = convolve(signal, fwhm, tvec, 0)

UPDATE 2: Working fast code, w a wrapper for convolving a random signal.

using LinearAlgebra
using SpecialMatrices

function wrapper()
    signal = rand(670, 2001)
    tvec = 0:0.5:1000
    fwhm = 150.0
    Flag = 0
    conv = convolve(signal, tvec, fwhm, Flag)
end


function Gaussian(x:: StepRangeLen{Float64}, x0:: Float64, fwhm:: T) where {T<:Union{Float64, Int64}}
    sigma = fwhm/2sqrt(2log(2))
    height = 1/sigma*sqrt(2pi)
    Fx = height .* exp.( (-1 .* (x .- x0).^2)./ (2*sigma^2))
    Fx = Fx./sum(Fx)
end


function Lorentzian(x:: StepRangeLen{Float64}, x0:: Float64, fwhm:: T) where {T<:Union{Float64, Int64}}
    gamma = fwhm/2
    Fx = 1 ./ ((pi * gamma) .* (1 .+ ((x .- x0)./ gamma).^2 ))
    Fx = Fx./sum(Fx)
end


function generate_kernel(tconv:: StepRangeLen{Float64}, fwhm:: Float64, centre:: Float64, ctype:: Int64)
    if ctype == 0
        K = Gaussian(tconv, centre, fwhm)
    elseif cytpe == 1
        K = Lorentzian(tconv, centre, fwhm)
    else
        error("Only Gaussian and Lorentzian functions currently available.")
    end
    K = Matrix(Circulant(K))
    return K
end


function convolve(S:: Matrix{Float64}, time:: StepRangeLen{Float64}, fwhm:: Float64, ctype:: Int64)
    nr:: Int64, nc:: Int64 = size(S)
    dt:: Float64 = sum(diff(time, dims=1))/(length(time)-1)
    nt:: Int64 = length(time)
    duration:: Float64 = fwhm*3 # must pad three time FHWM to deal with edges
    tmin:: Float64 = minimum(time); tmax:: Float64 = maximum(time)
    if dt != diff(time, dims=1)[1] ; error("Non-linear time range."); end
    if nr != nt && nc != nt; error("Inconsistent dimensions."); end
    if nc != nt; S = copy(transpose(S)); nr, nc = nc, nr; end # column always time axis

    tconv_min:: Float64 = tmin - duration ; tconv_max:: Float64 = tmax + duration
    tconv = tconv_min:dt:tconv_max # extended convoluted time vector
    ntc = length(tconv)
    padend:: Int64 = (duration/dt) # index at which 0 padding ends and signal starts
    padstart:: Int64 = (padend + tmax / dt) # index where signal ends and padding starts

    K = generate_kernel(tconv, fwhm, tconv[padend], ctype)

    sC = zeros(Float64, nr, ntc)
    sC[1:nr, 1:padend] .= @view S[1:nr, 1] # extended and pad signal
    sC[1:nr, padend:padstart] .= S
    sC[1:nr, padstart:end] .= @view S[1:nr, end]

    conv = zeros(Float64, nr, ntc)
    conv = convolution_integral(sC, K, dt)

    S .= conv[:, 1:nt] # return convoluted signal w same original size

    return S
    end


function convolution_integral(signal:: Matrix{Float64}, kernel:: Matrix{Float64}, dt:: Float64)
    LinearAlgebra.BLAS.set_num_threads(Sys.CPU_THREADS)
    conv = signal * kernel
    conv .*= dt
    return conv
end

using BenchMarktools:

julia> @btime wrapper();
  128.438 ms (11125 allocations: 300.24 MiB)

There is some strange slicing of the padded signal in the sense that the padding at early time flips to the end of the time extension after convolution. Something to do with the centering of the gaussian kernel I think. For now I just slice the the array back into it's original dimensions without the padding and get the result I expect.

Are you doing this in a function? Put your implementation into a function in Julia. — mbauman, May 21 '21 at 13:46
@AnderBiguri I realised in Julia I am just using the standard libraries multiplication function, using mul!() from LinearAlgebra provides some imporovement. I guess I need to call some LAPACK/BLAS routine to get similar speeds to matlab. I am aware of the FFT method, although I stuggled to implement it. Taking the fft of the signal and kernal and then multiplying them together before taking the ifft kept giving me complex values for the convolved signal. — Kyle, May 21 '21 at 14:53
Repeating my question: is your Julia implementation in a function? Could you edit your code into a fully working self-contained example? — mbauman, May 21 '21 at 15:20
@Kyle That is because you had a bug of course. That is a bad reason to avoid using a better technique though. plus MATLAB has functions that do it for you, such as `conv` — Ander Biguri, May 21 '21 at 15:22
@mbauman , this part of the code is within a function. To which I pass some signal, a time vector and a value of FWHM. Although I am calling two constructor types that are outside my function. Could that make it slow? I have updated the post with a more complete code, including the whole function definition. — Kyle, May 22 '21 at 11:14
“Julia must be faster than MATLAB because the makers of Julia say so.” LOL. Their benchmark comparing to MATLAB is awfully deceptive, using terrible programming techniques in MATLAB to make it seem slower. And they justify those techniques saying “but this is how people would like to program”, rather than doing things the way people actually program in MATLAB. Don’t believe the hype. — Cris Luengo, May 22 '21 at 14:24
Related: https://stackoverflow.com/q/51181392/7328782 ; https://github.com/JuliaLang/Microbenchmarks/issues/4 — Cris Luengo, May 22 '21 at 14:49
@Cris Luengo, Thanks for the info. That is disappointing as speed is meant to be its main selling point and was the whole reason I decided to give Julia a go. To see that the loops are slower than MATLAB, which is by no means speedy at for loops is surprising. Rewritting the MATLAB code in terms of circulant matrices actually reduces the speed to 4s vs over 300 in Julia. Although perhaps I have written my Julia code in a grossly inefficient fashion. — Kyle, May 23 '21 at 11:17
You might have written the Julia code inefficiently. I don’t expect a huge difference between the two. MATLAB also has a JIT, just like Julia. This means that loops in MATLAB are not slow. They used to be, 20 years ago, and people somehow still haven’t gotten the message. There are things in MATLAB that still have an overhead, like function calls, which likely is smaller in Julia. But MATLAB certainly has optimized the underlying computation. — Cris Luengo, May 23 '21 at 13:15
“MATLAB must be faster than Julia because it's expensive.” LOL. Bad takes aside, though, I wouldn't expect a huge difference between the two languages in this particular case either — Matlab's for loop isn't doing much work itself. At a quick glance I can see your Julia code is type unstable (all those type annotations aren't necessary and aren't helping), but it's still not written in a reproducibly. What are the sizes of the inputs? If you want to change the algorithm, you might as well change to an FFT as suggested above. Otherwise, the Julia code should look fairly similar to MATLAB's. — mbauman, May 24 '21 at 14:23
@mbauman, Yeah I discovered the main issue was the unstable types, in addition I was doing matmul with a matrix of type Circulant, which reduces multiplication speed vs type Matrix (I guess BLAS is not called by default in the case of Circulant type). I have re-written my Julia code, ensuring it is type stable and getting rid of the unnecessary structs. It is now faster than my MATLAB code. MATLAB runs in 4 seconds, the Julia version runs in 170 ms (tested using BenchmarkTools). In both codes I am uses the same inputs, the padded signal is size (670, 3801) and the kernel is (3801, 3801). — Kyle, May 24 '21 at 14:40
Any chance you can post the updated code as an answer for posterity? — Oscar Smith, May 24 '21 at 18:09
xref: https://discourse.julialang.org/t/how-to-make-my-julia-code-fast-like-matlab/61678 — carstenbauer, May 25 '21 at 07:04
@OscarSmith I have now edited the post to include the new code :) — Kyle, May 25 '21 at 15:54

Julia outperformed by MATLAB?

0 Answers0