This question is about composing (aka stacking, repeating, successively applying) convolution layers with no nonlinearity between them. I first encountered this idea in Justin Johnson's Stanford/UMich computer vision slides. To motivate the inclusion of nonlinearities in convolutional neural networks, he writes
Q: What happens if we stack two convolution layers? A: We get another convolution! Solution: Add activation function between conv layers.
For a year I took this on faith. After all, it was obvious that we needed nonlinearities in feedforward neural nets. Otherwise we'd have a composition of affine (or, with no bias, linear) transformations, which was itself an affine (or linear) transformation; what was done with two layers could be done with one. Indeed, $x \mapsto W_2 (W_1 x + b_1) + b_2 $ is the same as $x \mapsto \tilde{W} x + \tilde{b}$ for $\tilde{W} = W_2 W_1$ and $\tilde{b} = W_2 b_1 + b_2$, so we might as well just learn the paramaters $\tilde{W}$ and $\tilde{b}$ for a single fully-connected layer.
But for convolutions, I'm having trouble finding an algebraic display of the redundancy of repeated layers. Is there any algebraic demonstration akin to the $x \mapsto \tilde{W} x + \tilde{b}$ idea above? Or just any compelling explanation, algebraic or not, besides what I have below?
Here's my best explanation thus far. After reading these two posts, I realize that applying two convolutional layers to an image $I$ is a linear map from $\mathbb{R}^{C \times H \times W}$ to $\mathbb{R}^{F \times H' \times W'}$, so we won't be able to express any function that isn't linear function with this portion of the CNN.