12

Say I have some deep learning model architecture, as well as a chosen mini-batch size. How do I derive from these the expected memory requirements for training that model?

As an example, consider a (non-recurrent) model with input of dimension 1000, 4 fully-connected hidden layers of dimension 100, and an additional output layer of dimension 10. The mini-batch size is 256 examples. How does one determine the approximate memory (RAM) footprint of the training process on the CPU and on the GPU? If it makes any difference, lets assume the model is trained on a GPU with TensorFlow (thus using cuDNN).

Whaa
  • 131

2 Answers2

4

The answer of @ik_vision describes how to estimate the memory space needed for storing the weights, but you also need to store the intermediate activations, and especially for convolutional networks working with 3D data, this is the main part of the memory needed.

To analyze your example:

  1. Input needs 1000 elements
  2. After layers 1-4 layer you have 100 elements, 400 in total
    1. After final layer you have 10 elements

In total for 1 sample you need 1410 elements for the forward pass. Except for the input, you also need a gradient information about each of them for backward pass, that is 410 more, totaling 1820 elements per sample. Multiply by the batch size to get 465 920.

I said "elements", because the size required per element depends on the data type used. For single precision float32 it is 4B and the total memory needed to store the data blobs will be around 1.8MB.

Jan Kukacka
  • 11,421
1

I see two option:

  1. The Network is loaded from disk
  2. The Network is created on the fly

In both cases the size of the memory in GPU need to be multiplied by the Batch size as most of the network is copied for each sample.

Rule of Thumb if loaded from Disk: If the DNN takes X MB on Disk , the network will be 2X in the GPU memory for batch size 1.

The Network is created on the fly for batch size 1: count the parameter and multiply by 4 bytes (float32 bit): Counting the number of Parameter Manually: fc1 : 1000x100 (weights) + 100 (biases) fc2 : 100x100 (weights) + 100 (biases) fc3 : 100x100 (weights) + 100 (biases) fc4 : 100x100 (weights) + 100 (biases) output : 100x10 (weights) + 10 (biases)

Counting the number of Parameter using Keras: model.count_params()

  • 3
    As far as I can tell, this gives the memory requirements for storing the weights themselves, but ignores any memory dedicated to storing anything required strictly for the training, such as the gradients. Storing the gradients is required, say for implementing momentum. am I missing? – Whaa Jul 21 '16 at 14:02
  • 1
    @Whaa this is correct, for normal training you need memory to store the weights, the activations in the forward pass and the gradients in the back-propagation pass (3x the memory even without momentum). – mjul Dec 12 '17 at 11:36
  • @mjul my experiments show 4.5x ik_vision's estimate. I understand the rationale behind the 3x but I'm not sure why in practice it's using 4.5x. There must be other Keras/TF overhead?? – Wes Apr 12 '18 at 20:05