Confusion with softmax

Question

I was wondering if someone could explain why, if I do softmax on

[683, 861, 981, 834]

I get

[3.80403403e-130 7.66764807e-053 1.00000000e+000 1.44115655e-064]

But if I take a factor of 100 out:

[6.832, 8.61, 9.81, 8.34]

then I get this:

[0.03217071 0.19038654 0.63210557 0.14533718]

Which is more inline with what I'd expect. Clearly I don't understand softmax, I was wondering if someone could explain?

I'm using the output of the softmax as probabilities to select actions to take in a neural network, but because the largest output is equal to 1 then it's always selecting the largest action without any probability of others being selected.

Perhaps I should use a function that sums the entries and bases probabilities on the proportion each entry makes up of the sum...?

I'm using the softmax function from a Python programming language called sklearn:

def softmax(X, copy=True):
    if copy:
        X = np.copy(X)
    max_prob = np.max(X, axis=1).reshape((-1, 1))
    X -= max_prob
    np.exp(X, X)
    sum_prob = np.sum(X, axis=1).reshape((-1, 1))
    X /= sum_prob
    return X

You could be right. I've updated the question with the softmax code I use from sklearn, which is a reputable package used in machine learning, so I'd be a bit surprised if it was using a bad implementation of softmax — BigBadMe, Oct 11 '19 at 18:28
It looks like your vector input for softmax is relatively unbalanced then. Each term is exponentiated and so there’s a massive difference in your outputs, which would be expected. I thought for a minute that you subtracted a multiple of 100 , which shouldn’t change the answer but instead you divided, which would. The fact is that one of your outputs is much more likely than the others. — Alex R., Oct 11 '19 at 18:50

score 2 · Accepted Answer · answered Oct 11 '19 at 19:58

2

why we are using softmax in first place - when we have on output [.5, -1] we would want to have positive values that sums to 1 so softmax was invented but you can develop your own function for this transformation (remember to deal with negative values)

now as for exponent property:

$e^{a+b} = e^a * e^b$ for $a, b$ small softmax output will be quite balanced but for high values will be unbalanced e.g. for output [6, 7] -> [0.2689, 0.731] since $e^{7}/e^{6} = e$ but for output [600, 700] -> [3.72e-44, 1] since $e^{700}/e^{600} = e^{100}$

these high values may be caused by fact that you are making prediction for outlier (with respect to training data) it's very rare for models to have scores >100

answered Oct 11 '19 at 19:58

quester

498

Thanks @quester that makes sense. Yes I think my model / updates might be a tad iffy to be outputting such high values. I'll investigate. – BigBadMe Oct 12 '19 at 08:55
@BigBadMe or maybe you didn't preprocess data for predictions (normalization etc.) – quester Oct 12 '19 at 12:00
Thanks for the suggestion but I don't think that's it as I've checked all state information and reward is normalized. It appears that for some reason the gradient is exploding and I'm not sure why. I'll need to have a play around with it. Could be the batch size is too small, or I need to reduce the learning rate... – BigBadMe Oct 12 '19 at 16:43

Confusion with softmax

1 Answers1