Batch normalization and the need for bias in neural networks

Question

I've read that batch normalization eliminates the need for a bias vector in neural networks, since it introduces a shift parameter that functions similarly as a bias. As far as I'm aware though, a bias term works on a per-node level whereas the shift parameter of batch normalization is applied to all the activations. For instance in convolutional neural networks, the bias vector can be seen as an extra receptive field input of which the input is always 1. This effectively shifts each individual activation as opposed to shifting all activations at once.

My question thus is: is it true that batch normalization eliminates the need for a bias vector?

Someone can correct me if I'm wrong, but bias term is a very cheap parameter to have, so you probably wouldn't care that much about it. — Tim, May 12 '21 at 10:12
Batch norm has vectors for both the shift and scale parameters. https://stats.stackexchange.com/questions/414630/dimensions-of-scale-gamma-and-offset-beta-in-batch-norm/437474#437474 Once this misunderstanding is corrected, OP's explanation of the redundancy of multiple bias vectors is correct and the logic is sound. — Sycorax, Aug 15 '23 at 15:11

score 3 · Accepted Answer · answered Apr 09 '21 at 09:53

Check your software, but broadly the answer to your question is: yes, using batch normalization obviates the need for a bias in the preceding linear layer. Your question does a good job of laying out where you are confused, so let me speak to it: the shift term in batch normalization is also a vector, for instance the documentation for BatchNorm2d in Pytorch reads: "The mean and standard-deviation are calculated per-dimension over the mini-batches and $\gamma$ and $\beta$ are learnable parameter vectors of size $C$ (where $C$ is the input size). "

Andreas K. · Answer 2 · 2022-01-26T20:58:59.673

3

Here's a quote from the original BN paper that should answer your question:

i.e. each activation is shifted by its own shift parameter (beta). So yes, the batch normalization eliminates the need for a bias vector.

Just a side note: in Pytorch the BN's betas are all initialized to zero by default, whereas the biases in linear and convolutional layers are initialized to random values.

edited Jan 26 '22 at 20:58

answered Jan 26 '22 at 20:52

Andreas K.

266

score 0 · Answer 3 · edited Jan 19 '24 at 11:09

0

Yes. Let $x'=wx+b$. Then $E[x']=wE[x]+b$ and $x'-E[x']$ becomes $wx+b-wE[x]-b$ and $b$ gets canceled. And $\beta$ does the job of $b$.

edited Jan 19 '24 at 11:09

utobi

11,726

answered Aug 15 '23 at 14:58

You can use mathjax for math typesetting. https://math.meta.stackexchange.com/questions/5020/mathjax-basic-tutorial-and-quick-reference – Sycorax Aug 15 '23 at 15:58

Batch normalization and the need for bias in neural networks

3 Answers3