Can I use tanh activation function in the output layer for binary classification?

Question

I am new in the field of machine learning. So this question may sound silly. We usually use $sigmoid$ in output layer for binary classification. In my experiments, I found that $tanh$ gives higher accuracy and lower binary cross entropy loss if $sigmoid$ is replaced with $tanh$ in the output layer. Can someone please explain the possible reason? I am using labels as $0$ and $1$.

The code is shown below. I am using Keras with TensorFlow in the backend.

input_shape = (200, )    
left_input = Input(input_shape)  
right_input = Input(input_shape)

model = Sequential()  
model.add(Dense(200,input_dim=200,kernel_initializer='glorot_uniform',bias_initializer='zeros'))
model.add(Activation('tanh'))
model.add(Dropout(0.1))

model.add(Dense(200,input_dim=200,kernel_initializer='glorot_uniform',bias_initializer='zeros'))
model.add(Activation('tanh'))
model.add(Dropout(0.1))

x1 = model(left_input)
x2 = model(right_input)

dotted = Dot(axes=1,normalize=True)([x1, x2])  
out = Dense(1,activation='sigmoid',kernel_initializer='glorot_uniform',bias_initializer='zeros')(dotted)

siamese = Model(inputs=[left_input, right_input], outputs=out)
siamese.compile(loss='binary_crossentropy', optimizer='Adagrad', metrics=['accuracy'])

Thanks Lugi for your comment. No. In my experiments, both classes have an equal number of samples. — talk2speech, Feb 21 '18 at 14:24
Hyperbolic tangent is just a rescaling of the logistic function.
$$ f(x) = \frac{\exp(x)}{1 + \exp(x)} \ \tanh(x) = 2f(2x) - 1 $$

However, $\tanh(x)\in[-1,1]$, so it's not clear how you're computing cross-entropy loss. Cross entropy loss is using logarithms of probabilities, and logarithms of negative numbers are not real. Are you sure this isn't just an artifact of clipping $\tanh(x)$ when it's non-positive? — Sycorax, Feb 21 '18 at 14:44
Is your question a duplicate of this one? https://stats.stackexchange.com/questions/221901/can-the-cross-entropy-cost-function-be-used-with-tanh — Sycorax, Feb 22 '18 at 01:22
Thank you very much, Sycorax for the explanation and the pointer another related question. I am computing binary cross entropy in the same method as you explained. But it is not still clear to me why I am getting higher performance (in terms of accuracy for both training and test set) when I use tanh instead of standard sigmoid. I checked the output; they are always between 0 and 1. — talk2speech, Feb 22 '18 at 12:39
Your code would help, as it is most likely that you have a bug in it. — Lugi, Feb 22 '18 at 12:56
out = Dense(1, activation='sigmoid',kernel_initializer='glorot_uniform',bias_initializer='zeros')(dotted) You're not using $\tanh$ here; you're using a sigmoid activation. — Sycorax, Feb 22 '18 at 14:36
Yes. For this, I intentionally wrote sigmoid. This gives poor performance. But if I use tanh instead, I get higher accuracy. — talk2speech, Feb 22 '18 at 14:43

Sycorax · Accepted Answer · 2018-02-22T16:38:59.933

2

The line dotted = Dot(axes=1,normalize=True)([x1, x2]) computes the cosine of the angle $\theta$ between x1 and x2. If it's always true that $\cos(\theta)>0$, that implies $0 < \tanh(\cos(\theta)) < 1$. Under these conditions, this resolves the riddle of how you're getting proper probabilities using $\tanh$. But remember that you're applying a linear transformation, rather than $\tanh(\cos(\theta))$ directly, so you further require that even after applying the linear transformation, the bounds are still respected.

As for why performance for $\tanh$ is better than $\text{sigmoid}$ in this case, it could be the usual reason that NN researcher suggest: $\tanh$ has steeper gradients, so backprop is more effective.

edited Feb 22 '18 at 16:38

answered Feb 22 '18 at 16:11

Sycorax

90,934

Exactly! Thanks a lot for your help in understanding the problem. Thanks Lugi for asking me to share the code. In most cases, they are greater than $0$. Could you please suggest some references "$tanh$ has steeper gradients, so backprop is more effective." This will help to understand the problem in details and related background. – talk2speech Feb 22 '18 at 16:26
https://stats.stackexchange.com/questions/101560/tanh-activation-function-vs-sigmoid-activation-function/101563#101563 – Sycorax Feb 22 '18 at 16:30

Can I use tanh activation function in the output layer for binary classification?

1 Answers1