The idea of applying filters to do something like identify edges, is a pretty cool idea.
For example, you can take an image of a 7. With some filters, you can end up with transformed images that emphasize different characteristics of the original image. The original 7:
can be experienced by the network as:
Notice how each image has extracted a different edge of the original 7.
This is all great, but then, say the next layer in your network is a Max Pooling layer.
My question is, generally, doesn't this seem a little bit like overkill? We just were very careful and deliberate with identifying edges using filters -- now, we no longer care about any of that, since we've blasted the hell out of the pixel values! Please correct me if I'm wrong, but we went from 25 X 25 to 2 X 2! Why not just go straight to Max Pooling then, won't we end up with basically the same thing?
As an extension the my question, I can't help but wonder what would happen if, coincidentally, each of the 4 squares all just happen to have a pixel with the same max value. Surely this isn't a rare case, right? Suddenly all your training images look the exact same.


The pooling operation provides a form of translation invariance? – SmallChess Sep 21 '16 at 09:40