3

In Google's paper YouTube-8M: A Large-Scale Video Classification Benchmark in section 4.1.2 Deep Bag of Frame (DBoF) Pooling second paragraph, the first sentence says:

The obtained sparse codes are fed into a pooling layer that aggregates the codes of the k frames into a single fixed-length video representation. We use max pooling to perform the aggregation.

I am a bit confused as to how the pooling layer would work, from my understanding pooling layers always have a spatial component to it e.g. the size and stride, as the features that it operates on also have a spatial components e.g. width/height of the images. But the sparse codes that they talk about are created from the layer before the classification layer of the an Inception network trained on image net, meaning that they are a 1024 dimensional deep feature, or an embedding in 1024 dimension space.

Therefore, how is pooling supposed to work on an embedding? Or have I misunderstood what the paper is trying to say?

YellowPillow
  • 1,251

1 Answers1

2

They are simply max-pooling over time.

You are right that the features are a a 1024 dimensional embedding of frames, which means you have a Kx1024 dimensional tensor representing the entire video, where K is the number of frames. For each feature, they just take the max feature activation from all of the frames in the video.

Does it seem overly-simple? Maybe, but the comparison they make to bag-of-words models is appropriate. Sometimes you are only interested in whether or not a frame-level feature appears anywhere in a video, and you don't care exactly where it appears.