I was reading Rethinking the Inception Architecture for Computer Vision paper and in the very beginning I faced with the following part :
- Spatial aggregation can be done over lower dimensional embeddings without much or any loss in representational power. For example, before performing a more spread out (e.g. 3 × 3) convolution, one can reduce the dimension of the input representation before the spatial aggregation without expecting serious adverse effects. We hypothesize that the reason for that is the strong correlation between adjacent unit results in much less loss of information during dimension reduction, if the outputs are used in a spatial aggregation context. Given that these signals should be easily compressible, the dimension reduction even promotes faster learning.
Does this refer to feature-map concatenation in inception module where they use a 1x1 filter before any larger kernels such as 3x3?
and by lower dimensional embedings, do they mean fewer number of feature-maps ?
For example, the 3x3 has 20 feature-map, and they apply a 1x1 with 10 feature-maps before applying that 3x3 kernel?
What does lower dimensional embedding mean in this context?