Are there mathematical reasons for convolution in neural networks beyond expediency?

Question

In convolutional neural networks (CNN) the matrix of weights at each step gets its rows and columns flipped to obtain the kernel matrix, before proceeding with the convolution. This is explained on a series of videos by Hugo Larochelle here:

Computing the hidden maps would correspond to doing a discrete convolution with a channel from the previous layer, using a kernel matrix [...], and that kernel is computed from the hidden weights matrix $W_{ij}$, where we flip the rows and the columns.

If we were to compare the reduced steps of a convolution to regular matrix multiplication as in other types of NN, expediency would be a clear explanation. However, this might not be the most pertinent comparison...

In digital imaging processing the application of convolution of a filter to an image (this is a great youtube video for a practical intuition) seems related to:

The fact that convolution is associative while (cross-)correlation is not.
The possibility to apply filters in the frequency domain of the image as multiplications, since convolution in the time domain is equivalent to multiplication in the frequency domain (convolution theorem).

In this particular technical environment of DSP correlation is defined as:

$$F\circ I(x,y)=\sum_{j=-N}^{N}\sum_{i=-N}^N\, F(i,j)\,I(x+i, y+j)$$

which is essentially the sum of all the cells in a Hadamard product:

$$\small F\circ I(x,y)=\Tiny\begin{bmatrix}F[-N,-N]\,I[x-N,y-N]&\cdots&F[-N,0]\,I[x-N,y-N]&\cdots& F[-N,N]\,I[x-N,y+N]\\ \vdots&\ddots&\vdots&\ddots&\vdots\\ F[0,-N]\,I[x,y-N]&\cdots&F[0,0]\,I[x,y]&\cdots& F[0,N]\,I[x,y+N]\\ \vdots&\ddots&\vdots&\ddots&\vdots\\ F[N,-N]\,I[x+N,y-N]&\cdots&F[N,0]\,I[x+N,y]&\cdots& F[N,N]\,I[x+N,y+N]\\ \end{bmatrix}$$

where $F(i,j)$ is a filter function (expressed as a matrix), and $I(x,y)$ is the pixel value of an image at location $(x,y)$:

The objective of cross-correlation is to assess how similar is a probe image to a test image. The calculation of a cross-correlation map relies on the convolution theorem.

On the other hand, convolution is defined as:

$$F* I(x,y)=\sum_{j=-N}^{N}\sum_{i=-N}^N\, F(i,j)\,I(x-i, y-j)$$

which as long as the filter is symmetric, it is the same as a correlation operation with the rows and columns of the filter flipped:

$$\small F* I(x,y)=\Tiny\begin{bmatrix}F[N,N]\,I[x-N,y-N]&\cdots&F[N,0]\,I[x-N,y-N]&\cdots& F[N,-N]\,I[x-N,y+N]\\ \vdots&\ddots&\vdots&\ddots&\vdots\\ F[0,N]\,I[x,y-N]&\cdots&F[0,0]\,I[x,y]&\cdots& F[0,-N]\,I[x,y+N]\\ \vdots&\ddots&\vdots&\ddots&\vdots\\ F[-N,-N]\,I[x+N,y-N]&\cdots&F[-N,0]\,I[x+N,y]&\cdots& F[-N,-N]\,I[x+N,y+N]\\ \end{bmatrix}$$

Convolution in DSP is meant to apply filters to the image (e.g. smoothing, sharpening). As an example, after convolving Joseph Fourier's face with a Gaussian convolution filter: $\small\begin{bmatrix} 1&4&7&4&1\\ 4&16&26&16&4\\ 7&26&41&26&7\\ 4&16&26&16&4\\ 1&4&7&4&1\end{bmatrix}$ the edges on his face are fuzzier:

Computationally, both operations are a Frobenius inner product, amounting to calculating the trace of a matrix multiplication.

Questions (reformulating after comments and first answer):

Is the use of convolutions in CNN linked to FFT?

From what I gather so far the answer is no. FFTs have been used to speed up GPU implementations of convolutions. However, FFT are not usually part of the structure or activation functions in CNNs, despite the use of convolutions in the pre-activation steps.

Is convolution and cross-correlation in CNN equivalent?

Yes, they are equivalent.

If it is a simple as "there is no difference", what is the point of flipping the weights into the kernel matrix?

Neither the associativity of convolution (useful in math proofs), nor any considerations regarding FTs and the convolution theorem are applicable. In fact, it seems as though the flipping doesn't even take place (cross-correlation being simply mislabeled as convolution) (?).

Antoni can you point to any source where they flip the weights? As @hossein pointed out you can do anything with crosscorrelation that you can do with convolutions, just by flipping the ordering. So all this frequency domain stuff is irrelevant. — seanv507, Mar 27 '17 at 06:13
@seanv507 I edited my OP to include the source for what I believe you are asking me. I do get that convolution is the same operation as cross-correlation with a flipped filter matrix, but I don't get why we go over the "pain" of the discrete convolution if there isn't anything to it that couldn't be accomplished with correlation. The answer given is clearly knowledgeable, but it could be fitted into a comment, and does not explain the reason behind two distinct operations (is it a "tradition" from DSP carried over to ML?), the implications as to the learning process, and the relation to FT's. — Antoni Parellada, Mar 27 '17 at 12:23
I'm not seeing mention of frequency domain speed-up or translation invariance. Those are part of it. — EngrStudent, Mar 27 '17 at 15:11
@EngrStudent I'm not sure whether you mean that you are missing this in my already prohibitively long OP, or on the links, but the OP is far from a rhetorical question, and the terse answer provided is clearly accurate given the upvotes, yet it does very little to usher the non-initiated. So I am looking forward to an answer (not necessarily a dissertation) that touches on the issues brought up in the OP. — Antoni Parellada, Mar 27 '17 at 15:16
@EngrStudent For instance, this is a great answer almost addresses my OP: the animation may be a cross-correlation (don't know if mat is flipped), and the term "convolution" may be lax (?) lexicon. It may be that it makes no difference if FFT are really never applied in CNNs (as opposed to DSP), and then flipping the rows and cols is an unnecessary step, brought up just to illustrate the equivalence of cross-correlation and convolution in CNN. If these points were included in an answer, I'd be happy to accept it. — Antoni Parellada, Mar 27 '17 at 15:44
Antoni, there is no need to flip. It's more of a convention; in dsp people talk about convolution rather than cross correlation, and cross correlational neural networks doesn't roll off the tongue. But the more natural (for humans to interpret) operation is cross correlation (you are template matching) for CNN (consider eg a vertical edge filter rather than a rotation symmetric) . So I think Hugo larochelle is maybe talking about numerical libraries where convolution rather than cross correlation is standard function.(he is effectively saying cross correlation can be done by a convolution.) — seanv507, Mar 27 '17 at 20:54
@seanv507 Thank you! Your comment is the missing piece! I presume that the three, provisionally self-answered questions at the end of the post are correct, aren't they? I want to encourage you again to post a formal answer. Your comment, again, was the most useful piece of information. — Antoni Parellada, Mar 27 '17 at 20:57
@AntoniParellada I added some more points to my answer. I hope this is now addressing your final question. — Hossein, Mar 27 '17 at 21:00
@Hossein Yes, it makes a huge difference. Thank you. Regarding the edited OP, and the specific question 1, "Is the use of convolutions in CNN linked to FFT?", can you comment a bit on whether the answer I presume is correct is indeed correct? — Antoni Parellada, Mar 27 '17 at 21:03
Antoni, agree on the 2 first questions/answers, and my comment was answering the 3rd question. — seanv507, Mar 27 '17 at 21:04
@AntoniParellada Yes, your first answer is correct. To be more clear, in classic machine vision it was a strong link between convolution and FFT (to interpret high-pass/low-pass filters), but this link is not necessary for CNNs. — Hossein, Mar 27 '17 at 21:06
While there isn't necessarily a direct link between CNN's and FFT's, there is evidence that shows that you can obtain speedup of CNN's by using FFT's when you retain the usual convolution sum. See here for example: https://arxiv.org/pdf/1312.5851.pdf — Alex R., Mar 27 '17 at 21:34
IMO,. convolution is nature's best solution to path-finding in a higher-dimensional learning space within a 3-d geometry (i.e. brain). So, yes, expediency is the reason. — Marcos, Aug 12 '17 at 15:47

score 9 · Accepted Answer · edited Nov 03 '21 at 13:34

9

There are no differences in what neural networks can do when they use convolution or correlation. This is because the filters are learned and if a CNN can learn to do a particular task using convolution operation, it can also learn to do the same task using correlation operation (It would learn the rotated version of each filter).

To find more details about the reasons that people sometimes find it more intuitive to think about convolution than correlation, this post may be useful.

There remains this question that if there is no difference between convolution and cross-correlation, what is the point of flipping the weights into the kernel matrix? I would like to include some sentences from the Deep learning book by Ian Goodfellow et al. to answer this question:

The only reason to flip the kernel is to obtain the commutative property. While the commutative property is useful for writing proofs, it is not usually an important property of a neural network implementation... Many machine learning libraries implement cross-correlation but call it convolution.

The takeaway is that although convolution is a favorite operation in classic machine vision applications, it is replaced by correlation in many of the implementations of the convolutional neural networks.

edited Nov 03 '21 at 13:34

mhdadk

4,940

answered Mar 26 '17 at 09:51

Hossein

3,454

Thank you. I read with attention the blog you link to, and it seems as though the use of convolution is not simply equivalent to correlation, and does respond to frequency-domain feature selection. I am looking for an answer elaborating on this. – Antoni Parellada Mar 26 '17 at 10:19
As I know, they are equivalent in what they can do, since both do a dot product of two matrices, but the convolution flips the filter matrix before dot product, and since CNNs learn the filters, they can learn the flipped filters. – Hossein Mar 26 '17 at 11:00
+1 to Hosseins explanation, but -1 for the blog link. The blog is mainly focused on hardware, and he is a CS guy with no background in convolution and other signal processing concepts. – seanv507 Mar 27 '17 at 06:03
I would like to still insist on having some additional paragraph on the relationship (or lack thereof) between convolution in CNNs and Fourier transforms. – Antoni Parellada Mar 28 '17 at 21:04

Matt Krause · Answer 2 · 2019-07-11T19:50:17.383

There's a practical reason for the link between FFTs and convolution.

Convolution is slow in the time/image domain. Applying an $n \times n$ filter to one pixel requires $O(n^2)$ multiplications and additions. Applying it to every pixel in an $N \times N$ image thus requires $n^2N^2$ operations. This grows quickly, and the large number of operations not only requires extra time, but also introduces more numerical error.

The Convolution Theorem says that convolution in the time domain is equivalent to pointwise multiplication in the frequency domain. FFTs are fast: they have good asymptotic performance $O(N^2 \log N^2)$ and the actual implementations are often highly optimized. Switching to the Fourier domain thus lets you perform a convolution in $O(N^2)$ time (which is dominated by the pointwise multiplication), instead of $O(n^2N^2)$. This can provide a considerable speedup, even though it seems much more complicated to go down the FFT -> multiplication -> inverse FFT route. More here

Are there mathematical reasons for convolution in neural networks beyond expediency?

2 Answers2