To supplement Sycorax's answer on how a neural network might represent the function, I thought I'd see whether a simple network can learn that representation. The target network has two hidden neurons with ReLU activation and an output neuron with sigmoid activation.
Notebook
Here's my setup:
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
import numpy as np
n = 1000
np.random.seed(314)
x1 = np.random.randint(-100, 101, size=n)
p = np.random.poisson(size=n)
x2 = x1 + p
X = np.vstack((x1, x2)).T
X = X / 100.0
y = (p == 0)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
I cannot coax scikit-learn's MLPClassifier to learn the two-neuron structure. Perhaps by trying loads of initial states I could get something close enough that the learning process would settle down to the desired state, but with just a handful of attempts I couldn't make it.
Expanding to 100 hidden neurons, just a little fiddling with other hyperparameters gives perfect accuracy on an iid test set; but with that many neurons it seems to be overfitting on the training set, because it fails on an out-of-scale test set (x1 defined from 200 to 300, the rest as above).
Fiddling by hand with the hyperparameters some more, I'm able to get a good-looking network with 5 hidden neurons:
model = MLPClassifier(
(5,),
learning_rate_init=0.05,
learning_rate="adaptive",
alpha=0,
max_iter=1000,
random_state=0,
)
model.fit(X_train, y_train)
print(model.score(X_test, y_test))
#> 0.964
print(model.coefs_)
#>
[array([[-5.28193302, -4.71679774, -0.20732829, -0.82536738, -0.14136384],
[ 5.26312562, 4.70770845, 0.31451317, 1.42101998, -0.21582437]]),
array([[-16.27265811],
[-18.1835566 ],
[ 0.1559244 ],
[ -0.38534808],
[ 0.7400243 ]])]
You can see that the first and second neurons are finding the right idea, while the last three are a bit off; and the output neuron is starting to ignore those three in favor of the first two (with large negative coefficients, the $\delta$ of Sycorax's formula). More data would probably strengthen the correct relationship, but this already performs well on the out-of-range test data.
Oh, by taking just a Poisson difference above, always $x_2\ge x_1$, which explains why the two important neurons are both firing on something like $x_2-x_1$ rather than one being $x_1-x_2$. Multiplying p by a random sign, I have a much harder time getting MLPClassifier to train a good model. By switching away from ReLU to tanh I can, and in fact manage with just a 2-neuron layer:
model = MLPClassifier(
(2,),
activation='tanh',
solver='lbfgs',
max_iter=1000,
random_state=0,
)
#>
[array([[-155.04975387, 62.57832368],
[ 155.0491934 , -62.57812308]]),
array([[ 75.66146126],
[168.25414012]])]
a ≠ bwherea > b. In your feature set, all the cases fall intoa = bora < b. – Benjamin R Jan 30 '23 at 20:00