I am new in the field of machine learning. So this question may sound silly. We usually use $sigmoid$ in output layer for binary classification. In my experiments, I found that $tanh$ gives higher accuracy and lower binary cross entropy loss if $sigmoid$ is replaced with $tanh$ in the output layer. Can someone please explain the possible reason? I am using labels as $0$ and $1$.
The code is shown below. I am using Keras with TensorFlow in the backend.
input_shape = (200, )
left_input = Input(input_shape)
right_input = Input(input_shape)
model = Sequential()
model.add(Dense(200,input_dim=200,kernel_initializer='glorot_uniform',bias_initializer='zeros'))
model.add(Activation('tanh'))
model.add(Dropout(0.1))
model.add(Dense(200,input_dim=200,kernel_initializer='glorot_uniform',bias_initializer='zeros'))
model.add(Activation('tanh'))
model.add(Dropout(0.1))
x1 = model(left_input)
x2 = model(right_input)
dotted = Dot(axes=1,normalize=True)([x1, x2])
out = Dense(1,activation='sigmoid',kernel_initializer='glorot_uniform',bias_initializer='zeros')(dotted)
siamese = Model(inputs=[left_input, right_input], outputs=out)
siamese.compile(loss='binary_crossentropy', optimizer='Adagrad', metrics=['accuracy'])
0class? – Lugi Feb 21 '18 at 14:08$$ f(x) = \frac{\exp(x)}{1 + \exp(x)} \ \tanh(x) = 2f(2x) - 1 $$
However, $\tanh(x)\in[-1,1]$, so it's not clear how you're computing cross-entropy loss. Cross entropy loss is using logarithms of probabilities, and logarithms of negative numbers are not real. Are you sure this isn't just an artifact of clipping $\tanh(x)$ when it's non-positive?
– Sycorax Feb 21 '18 at 14:44out = Dense(1, activation='sigmoid',kernel_initializer='glorot_uniform',bias_initializer='zeros')(dotted)You're not using $\tanh$ here; you're using a sigmoid activation. – Sycorax Feb 22 '18 at 14:36