In Google's paper YouTube-8M: A Large-Scale Video Classification
Benchmark in section 4.1.2 Deep Bag of Frame (DBoF) Pooling second paragraph, the first sentence says:
The obtained sparse codes are fed into a pooling layer that aggregates the codes of the k frames into a single fixed-length video representation. We use max pooling to perform the aggregation.
I am a bit confused as to how the pooling layer would work, from my understanding pooling layers always have a spatial component to it e.g. the size and stride, as the features that it operates on also have a spatial components e.g. width/height of the images. But the sparse codes that they talk about are created from the layer before the classification layer of the an Inception network trained on image net, meaning that they are a 1024 dimensional deep feature, or an embedding in 1024 dimension space.
Therefore, how is pooling supposed to work on an embedding? Or have I misunderstood what the paper is trying to say?