1

I am trying to train a small MLP in Pytorch. Here is the code for the net:

class Net(nn.Module):
def __init__(self):
    super(Net,self).__init__()

    self.ln_1 =  nn.Linear(600, 512)
    self.ln_2 = nn.Linear(512, 256)
    self.ln_3 = nn.Linear(256, 128)
    self.ln_4 = nn.Linear(128, 64)
    self.ln_5 = nn.Linear(64, 64)
    self.ln_6 = nn.Linear(64, 32)
    self.ln_7 = nn.Linear(32, 16)
    self.ln_8 = nn.Linear(16, 1)
    self.sig = nn.Sigmoid()
    self.relu = nn.ReLU()

def forward(self, x):
    print(f"Input: {x}")
    x =  F.sigmoid(self.ln_1(x))
    print(f"After Layer 1:{x}")
    x = F.sigmoid(self.ln_2(x))
    print(f"After Layer 2:{x}")
    x = F.sigmoid(self.ln_3(x))
    x = F.sigmoid(self.ln_4(x))
    x = F.sigmoid(self.ln_5(x))
    x = F.sigmoid(self.ln_6(x))
    x = F.sigmoid(self.ln_7(x))
    print(f"After Layer 7:{x}")
    x = F.sigmoid(self.ln_8(x))
    print(f"After Layer 8 (out):{x}")
    output = x

    return output

The input dim = 12x600 looks like this:

tensor([[[-0.0013, -0.0038, -0.0044,  ...,  0.0002,  0.0128,  0.0198],
     [-0.0043, -0.0026, -0.0003,  ..., -0.0002,  0.0038,  0.0057],
     [ 0.0364,  0.0272,  0.0145,  ...,  0.0054,  0.0084,  0.0114],
     ...,
     [-0.0179, -0.0068,  0.0040,  ..., -0.0222, -0.0262, -0.0192],
     [-0.0059, -0.0049, -0.0024,  ..., -0.0403, -0.0379, -0.0358],
     [ 0.0007,  0.0017,  0.0024,  ..., -0.0024, -0.0036, -0.0040]]])

And the output 1x12:

    tensor([[[0.0051],
         [0.0051],
         [0.0051],
         [0.0051],
         [0.0051],
         [0.0051],
         [0.0051],
         [0.0051],
         [0.0051],
         [0.0051],
         [0.0051],
         [0.0051]]], grad_fn=<SigmoidBackward0>)

The optimizer looks like this and I tried different lr values to no avail.

criterion = nn.MSELoss()
optimizer = optim.Adam(Net.parameters(), lr=0.0001)

Training loop is designed the following way:

for epoch in range(epochs):
    running_loss = 0.0
inp = inputs
feat = features

# zero the parameter gradients
optimizer.zero_grad()

# forward + backward + optimize
outputs = fin(inp)
loss = criterion(outputs.float(), feat)


loss.backward()
optimizer.step()

I want to know why despite different values being input to the model I end up with the same output for all 12 inputs? Thank you.

Below is the example of layer outputs at different points:

Input: tensor([[-0.0036, -0.0060, -0.0065,  ...,  0.0006, -0.0021, -0.0043],
        [ 0.0061,  0.0047,  0.0054,  ...,  0.0002, -0.0016, -0.0028],
        [ 0.0012,  0.0028,  0.0037,  ..., -0.0062, -0.0059, -0.0067],
        ...,
        [-0.0002,  0.0010,  0.0005,  ..., -0.0011, -0.0011,  0.0002],
        [-0.0003, -0.0012, -0.0010,  ..., -0.0022, -0.0002,  0.0020],
        [ 0.0005, -0.0013, -0.0027,  ...,  0.0037,  0.0047,  0.0045]])
After Layer 1:tensor([[0.4986, 0.4950, 0.5072,  ..., 0.4992, 0.5060, 0.4963],
        [0.4986, 0.4958, 0.5067,  ..., 0.4999, 0.5050, 0.4980],
        [0.4995, 0.4963, 0.5060,  ..., 0.4994, 0.5058, 0.4973],
        ...,
        [0.4983, 0.4964, 0.5053,  ..., 0.4992, 0.5055, 0.4976],
        [0.4982, 0.4944, 0.5061,  ..., 0.4989, 0.5048, 0.4972],
        [0.4983, 0.4949, 0.5056,  ..., 0.4997, 0.5058, 0.4970]],
       grad_fn=<SigmoidBackward0>)
After Layer 2:tensor([[0.4716, 0.3847, 0.3724,  ..., 0.5870, 0.5455, 0.4221],
        [0.4716, 0.3848, 0.3725,  ..., 0.5867, 0.5460, 0.4221],
        [0.4715, 0.3848, 0.3724,  ..., 0.5868, 0.5460, 0.4222],
        ...,
        [0.4716, 0.3849, 0.3724,  ..., 0.5871, 0.5461, 0.4220],
        [0.4716, 0.3848, 0.3724,  ..., 0.5869, 0.5461, 0.4221],
        [0.4717, 0.3848, 0.3725,  ..., 0.5869, 0.5462, 0.4222]],
       grad_fn=<SigmoidBackward0>)
After Layer 7:tensor([[0.4533, 0.4607, 0.5720, 0.4933, 0.4806, 0.5116, 0.6091, 0.4371, 0.5633,
         0.5099, 0.6671, 0.6205, 0.4186, 0.4554, 0.6118, 0.4706],
        [0.4533, 0.4607, 0.5720, 0.4933, 0.4806, 0.5116, 0.6091, 0.4371, 0.5633,
         0.5099, 0.6671, 0.6205, 0.4186, 0.4554, 0.6118, 0.4706],
        [0.4533, 0.4607, 0.5720, 0.4933, 0.4806, 0.5116, 0.6091, 0.4371, 0.5633,
         0.5099, 0.6671, 0.6205, 0.4186, 0.4554, 0.6118, 0.4706],
        [0.4533, 0.4607, 0.5720, 0.4933, 0.4806, 0.5116, 0.6091, 0.4371, 0.5633,
         0.5099, 0.6671, 0.6205, 0.4186, 0.4554, 0.6118, 0.4706],
        [0.4533, 0.4607, 0.5720, 0.4933, 0.4806, 0.5116, 0.6091, 0.4371, 0.5633,
         0.5099, 0.6671, 0.6205, 0.4186, 0.4554, 0.6118, 0.4706],
        [0.4533, 0.4607, 0.5720, 0.4933, 0.4806, 0.5116, 0.6091, 0.4371, 0.5633,
         0.5099, 0.6671, 0.6205, 0.4186, 0.4554, 0.6118, 0.4706],
        [0.4533, 0.4607, 0.5720, 0.4933, 0.4806, 0.5116, 0.6091, 0.4371, 0.5633,
         0.5099, 0.6671, 0.6205, 0.4186, 0.4554, 0.6118, 0.4706],
        [0.4533, 0.4607, 0.5720, 0.4933, 0.4806, 0.5116, 0.6091, 0.4371, 0.5633,
         0.5099, 0.6671, 0.6205, 0.4186, 0.4554, 0.6118, 0.4706],
        [0.4533, 0.4607, 0.5720, 0.4933, 0.4806, 0.5116, 0.6091, 0.4371, 0.5633,
         0.5099, 0.6671, 0.6205, 0.4186, 0.4554, 0.6118, 0.4706],
        [0.4533, 0.4607, 0.5720, 0.4933, 0.4806, 0.5116, 0.6091, 0.4371, 0.5633,
         0.5099, 0.6671, 0.6205, 0.4186, 0.4554, 0.6118, 0.4706],
        [0.4533, 0.4607, 0.5720, 0.4933, 0.4806, 0.5116, 0.6091, 0.4371, 0.5633,
         0.5099, 0.6671, 0.6205, 0.4186, 0.4554, 0.6118, 0.4706],
        [0.4533, 0.4607, 0.5720, 0.4933, 0.4806, 0.5116, 0.6091, 0.4371, 0.5633,
         0.5099, 0.6671, 0.6205, 0.4186, 0.4554, 0.6118, 0.4706]],
       grad_fn=<SigmoidBackward0>)
After Layer 8 (out):tensor([[0.5051],
        [0.5051],
        [0.5051],
        [0.5051],
        [0.5051],
        [0.5051],
        [0.5051],
        [0.5051],
        [0.5051],
        [0.5051],
        [0.5051],
        [0.5051]], grad_fn=<SigmoidBackward0>)

```

sage
  • 11
  • You can check each layers' outputs. – gunes Jul 30 '22 at 11:28
  • So I also tried inputting without a batch and got the same result for all of the 12 inputs – sage Jul 31 '22 at 06:00
  • I meant are the outputs same after the first layer, second layer etc.? – gunes Jul 31 '22 at 06:56
  • Can you also show how you are passing the 12x600 tensor to the network, and the complete training loop if possible? – Ganesh Tata Jul 31 '22 at 18:47
  • I just updated the question and added what you requested. It seems that after a few layers the inputs in the batch become the same. @gunes – sage Aug 15 '22 at 12:41
  • 1
    I replaced all of sigmoids aside from the last one with leaky ReLUs and now I am getting different outputs. – sage Aug 15 '22 at 13:21

1 Answers1

2

I met the same issue as you. I tried to fine-tune a large language model with more than millions of parameters but it outputs exactly the same for each batch. Finally, I figured out that I used a too-large learning rate with 0.0001 for the Adam optimizer.

Regularly, we leverage a tiny learning rate during the fine-tuning stage but I forgot to adjust it. In your case, though you are not using a pretrained model, you still set a too-large learning rate. As you stated, you change sigmoid to relu so that you get different outputs. I agree that Sigmoid is not the best choice for the activation function. However, if you take a look at how Sigmoid and Relu are different from each other. You will find that ReLu or LeakyRelu nearly lost their gradients if the input value is smaller than zero. That is why even if you use a big learning rate with a small batch size, the training process is still stable. In other words, the design of ReLu plays the role of preventing training vulnerability. But there may be a case when you cannot alter and need to fix to a certain kind of deep learning model architecture, the optimal way is to turn down the learning rate.

Fang WU
  • 151