I read this paper by Gatys and cannot comprehend why the Gram matrix for the style feature maps is necessary? In the paper it reads that:
On top of the CNN responses in each layer of the network we built a style representation that computes the correlations between the different filter responses, where the expectation is taken over the spatial extend of the input image.
To me, it says that because we need to know the correlations between different filters because different channels are responsible for different high lever features then the correlation matrix stands for maybe a combined style representation. Why the Gram matrix(or say the correlation matrix) works in the style transfer model? I also wonder why we don't apply the same mechanism to the content map?
I guess it is because we apply the Gram matrix to both the input image and the style image; while the content image is responsible only for early stage shaping. Am I right?
Is it explainable in language of mathematics?