what does "Spatial aggregation can be done over lower dimensional embeddings without much or any loss in representational power" mean?

Question

I was reading Rethinking the Inception Architecture for Computer Vision paper and in the very beginning I faced with the following part :

Spatial aggregation can be done over lower dimensional embeddings without much or any loss in representational power. For example, before performing a more spread out (e.g. 3 × 3) convolution, one can reduce the dimension of the input representation before the spatial aggregation without expecting serious adverse effects. We hypothesize that the reason for that is the strong correlation between adjacent unit results in much less loss of information during dimension reduction, if the outputs are used in a spatial aggregation context. Given that these signals should be easily compressible, the dimension reduction even promotes faster learning.

Does this refer to feature-map concatenation in inception module where they use a 1x1 filter before any larger kernels such as 3x3?
and by lower dimensional embedings, do they mean fewer number of feature-maps ?
For example, the 3x3 has 20 feature-map, and they apply a 1x1 with 10 feature-maps before applying that 3x3 kernel?

What does lower dimensional embedding mean in this context?

score 2 · Accepted Answer · answered Jul 31 '20 at 08:44

My guess is that "spacial aggregation" refers to a convolution operation. If you think about it, when you convolve a filter over an image (or a volume, in general), you multiply it with the underlying patches of the image and then sum the results. For instance, if the size of the filter is 3x3, you are de facto "spatially aggregating" the information contained in a 3x3 patch of the image.

As far as dimensionality reduction is concerned, I too think they mean a reduction of the number of feature maps (or channels, depending on how you want to call them).

Ultimately, I believe they're claiming that performing a 1x1 convolution to reduce the dimensionality before a 3x3 convolution does not impair the quality of the output feature maps. Notice that the presence of a spatial aggregation operator (i.e. the 3x3 convolution) is needed for this statement to be true, as mentioned by the authors of the paper:

[...] the strong correlation between adjacent unit results in much less loss of information during dimension reduction, if the outputs are used in a spatial aggregation context.

what does "Spatial aggregation can be done over lower dimensional embeddings without much or any loss in representational power" mean?

1 Answers1