Why we need the Gram matrix in style transfer learning?

Question

I read this paper by Gatys and cannot comprehend why the Gram matrix for the style feature maps is necessary? In the paper it reads that:

On top of the CNN responses in each layer of the network we built a style representation that computes the correlations between the different filter responses, where the expectation is taken over the spatial extend of the input image.

To me, it says that because we need to know the correlations between different filters because different channels are responsible for different high lever features then the correlation matrix stands for maybe a combined style representation. Why the Gram matrix(or say the correlation matrix) works in the style transfer model? I also wonder why we don't apply the same mechanism to the content map?

I guess it is because we apply the Gram matrix to both the input image and the style image; while the content image is responsible only for early stage shaping. Am I right?

Is it explainable in language of mathematics?

Lerner Zhang · Answer 1 · 2022-12-08T03:32:49.133

The only difference between the content loss and style loss lies in the latter employing the Gram matrix before MSE and the former directly using MSE.

The trick of the Gram matrix(differs from the content loss) is that: 1) it ignores the positions of the features/styles; 2) the correlation means that if two styles coocur the score would be high; and the correlation score gap between the style feature map and the input feature map should be as small as possible, meaning that style correlations should align as much as possible.

References:

Why we need the Gram matrix in style transfer learning?

1 Answers1