I am confused at a very trivial point about the BatchNorm. For illustrations it is a widely used graphics that BatchNorm corrects the distribution of incoming values. To calculate mean, does it average through a batch for each neurons separately or does it average both in 2 dimensions (batch & layer width)?
For example, if I zero some of the neurons after BatchNorm, does this spoil the output distribution or doesn't it since each neuron has its own zero-centered distribution already? (Assuming trainable weight and scale parameter is not used.)
In other words, are the mean and variance unique to a layer, or do we calculate mean & variance for each single neuron in a layer?