Can a neural network learn "a == b" and "a != b" relationship, with limited data?

Question

For example, I have the following feature set:

{a:1, b:1} -> 1
{a:2, b:2} -> 1
{a:3, b:3} -> 1
{a:1, b:2} -> 0
{a:4, b:6} -> 0
{a:3, b:5} -> 0

And some more data points. I don't give features like a == b or a != b.

Can a neutral network get the following logic?

(a == b)  ->  1
(a != b)  ->  0

Assume I give inputs like

{a:123, b:123}
{a:666, b:668}

The model never sees inputs such as 123 or 666 before.

Is this really what you want to do, or is this a XY problem? — Firebug, Jan 30 '23 at 15:14
The neural network might find the logic you want, or another logic, especially if the training set is not diversified enough. For instance, from your training set, notice that if a or b is strictly greater than 3, then the answer is always 0. Your neural network might notice that and use that as a criterion. — Stef, Jan 30 '23 at 16:34
Surely you are missing a necessary example for a ≠ b where a > b. In your feature set, all the cases fall into a = b or a < b. — Benjamin R, Jan 30 '23 at 20:00
probably a siamese neural network can do the trick. As they learn similarities instead of labels, you do not need all possible a==b and a!=b samples. Just a very few would do the trick. Siamese networks are usually good for open set problems (unseen data), which is exactly what you need. — mad, Jan 31 '23 at 15:16
My major curiosity is that if neural networks can learn inter-feature relationships, without any human interference (e.g., define something like "a-b" or "a/b"), for limited input data. — WindChaser, Feb 01 '23 at 07:12
Neural networks are local function approximators. Without inductive biases they are unlikely to learn such a function, that works on the whole real line. — Firebug, Feb 01 '23 at 17:31
@Firebug Do you mean it's easy to do "a>b" or "a<b", but hard for "a==b", right? — WindChaser, Feb 02 '23 at 18:18
Neither example is "easy" for neural networks without inductive biases — Firebug, Feb 03 '23 at 08:25

Sycorax · Answer 1 · 2023-01-30T18:32:08.917

The absolute value function can be written as

$$|a-b|=\text{ReLU}(a-b) + \text{ReLU}(b-a),$$ and has a minimum at 0 for $a = b$. We can compose this with a sigmoid layer

$$\sigma\left(\delta\left(\text{ReLU}(a-b) + \text{ReLU}(b-a)+\epsilon \right) \right),$$ and this is very close to what is desired for $\epsilon < 0$ because this shifts the minimum below 0 and choosing $\delta < 0$ means that a negative value maps to a number greater than 0.5, and a positive value to a number less than 0.5.

Naturally, there will be "wrong answers" for $|a-b|$ that are close to $\epsilon$. This is unavoidable with continuous functions (such as those used in neural networks). Changing the magnitude of $\epsilon$ controls this.

A deficiency with this is that its outputs are not exactly 0 and exactly 1. You can't obtain these values exactly because the sigmoid function only obtains 0 and 1 in a limit of infinitely large or small values. It's probably also hard to train a neural network to find weights that work well, especially for representing $|a-b|$.

Ben Reiniger · Accepted Answer · 2023-01-30T20:33:07.253

To supplement Sycorax's answer on how a neural network might represent the function, I thought I'd see whether a simple network can learn that representation. The target network has two hidden neurons with ReLU activation and an output neuron with sigmoid activation.

Notebook

Here's my setup:

from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
import numpy as np
n = 1000
np.random.seed(314)
x1 = np.random.randint(-100, 101, size=n)
p = np.random.poisson(size=n)
x2 = x1 + p
X = np.vstack((x1, x2)).T
X = X / 100.0
y = (p == 0)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

I cannot coax scikit-learn's MLPClassifier to learn the two-neuron structure. Perhaps by trying loads of initial states I could get something close enough that the learning process would settle down to the desired state, but with just a handful of attempts I couldn't make it.

Expanding to 100 hidden neurons, just a little fiddling with other hyperparameters gives perfect accuracy on an iid test set; but with that many neurons it seems to be overfitting on the training set, because it fails on an out-of-scale test set (x1 defined from 200 to 300, the rest as above).

Fiddling by hand with the hyperparameters some more, I'm able to get a good-looking network with 5 hidden neurons:

model = MLPClassifier(
    (5,),
    learning_rate_init=0.05,
    learning_rate="adaptive",
    alpha=0,
    max_iter=1000,
    random_state=0,
)
model.fit(X_train, y_train)
print(model.score(X_test, y_test))
#> 0.964 
print(model.coefs_)
#> 
[array([[-5.28193302, -4.71679774, -0.20732829, -0.82536738, -0.14136384],
        [ 5.26312562,  4.70770845,  0.31451317,  1.42101998, -0.21582437]]),
 array([[-16.27265811],
        [-18.1835566 ],
        [  0.1559244 ],
        [ -0.38534808],
        [  0.7400243 ]])]

You can see that the first and second neurons are finding the right idea, while the last three are a bit off; and the output neuron is starting to ignore those three in favor of the first two (with large negative coefficients, the $\delta$ of Sycorax's formula). More data would probably strengthen the correct relationship, but this already performs well on the out-of-range test data.

Oh, by taking just a Poisson difference above, always $x_2\ge x_1$, which explains why the two important neurons are both firing on something like $x_2-x_1$ rather than one being $x_1-x_2$. Multiplying p by a random sign, I have a much harder time getting MLPClassifier to train a good model. By switching away from ReLU to tanh I can, and in fact manage with just a 2-neuron layer:

model = MLPClassifier(
    (2,),
    activation='tanh',
    solver='lbfgs',
    max_iter=1000,
    random_state=0,
)
#> 
[array([[-155.04975387,   62.57832368],
        [ 155.0491934 ,  -62.57812308]]), 
 array([[ 75.66146126],
        [168.25414012]])]

oops, always x2>=x1, which is why the two good neurons in the end are both (-, +) instead of one being reversed. Lemme fix that... — Ben Reiniger, Jan 30 '23 at 16:51
unfortunately, the fix has made it rather more difficult to get the network to learn the correct rule! — Ben Reiniger, Jan 30 '23 at 17:17

Firebug · Answer 3 · 2023-01-31T08:52:49.217

6

You function can be represented as: $$f(x,y) = \lim_{n\to+\infty}(\sigma(1/|x-y|))^n$$

A good approximation can be obtained with a large $n$. A neural network can approximate further that function. In fact, depending on what you consider a neural network to be, the function itself is already a neural network with special activation functions.

edited Jan 31 '23 at 08:52

answered Jan 30 '23 at 15:20

Firebug

19,076
6
77
139

Can a neural network learn "a == b" and "a != b" relationship, with limited data?

3 Answers3