0

I have a NN that I would like to square a number. This is a learning exercise for me.

My input is the number to be squared, the output is the square.

Two questions: 1) How can this possibly work? The weights and nodes of the NN need to square to a number that isn't fixed.

2) Assuming I am wrong, what is a strategy for choosing the numbers of nodes and layers for a NN?

Dave
  • 62,186
  • 3
    As an example: https://stats.stackexchange.com/questions/299915/how-does-the-rectified-linear-unit-relu-activation-function-produce-non-linear/299933#299933 but a necessary unstated component to your question is what amount of precision you want in the result & in what interval; the universal approximation* theorem* lays out technical criteria for NNs to approximate specific functions. – Sycorax Jun 24 '19 at 18:29

2 Answers2

3

The ReLU activation function should take care of this.

ReLU works by fitting short, straight lines to approximate curves. That should be able to create a parabola. You will have performance suffer for inputs with very large absolute values, but we know that models won't be perfect.

I was thinking that one hidden layer could take care of this, but reading about the universal approximation theorem (which I suggest doing), we can be more efficient by having fewer nodes in multiple hidden layers than tons of nodes in one hidden layer.

EDIT

I didn't make this clear three years ago. The universal approximation theorem says that we can approximate on a compact set (on the real line, that means a closed and bounded subset of the number line). Once you go past that bound, all bets are off, which is why I say that you will have performance suffer for inputs with very large absolute values. For a visualization, imagine how an absolute value function ($\vert x\vert = ReLU(x) + ReLU(-x)$) could approximate $y=x^2$ for small numbers, such as $(-1, 1)$, but the approximation is awful for $x=10$, for instance.

Dave
  • 62,186
2

This is an interesting question. I wanted to contribute an answer which shows how we can do this practically in Python, and call out a few interesting things. I hope the interested reader will take the code, modify it and experiment themselves. I give a few suggestions for things to play around with at the end.

Python Implementation - using Pytorch

The code below creates a neural network using Pytorch. I have used the ReLU function between layers (see comment below). I have tried to find a balance between a network which is simple and easy to train, but which also does a reasonable job (at least on the interval [0,10], see comments and graph below).

The model is trained on random data from the range [0,10].

Graphs

This graph shows the predicted (blue) and actual (red) values, for unseen random input data from the range [0,10].

ReLU, predict Xsquared

  • It is interesting to note how poorly the model performs outside the region on which it is trained.

Things to experiment with

  • Try other activation functions or combinations (like tanh). If I keep everything in the code below identical but change the activation functions to tanh we get.

Using tanh

We can improve the performance with more epochs...

Using tanh, 500 epochs

I also note here that the function $x^2$ is non-linear, so you could use that as your activation function - but I do not think that is in the spirit of this question :D

  • See what happens if you use less training data or over a bigger range.
  • See what happens if you change the architecture, for example using fewer layers.

Code

import torch
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt
# Create training data
X = torch.distributions.uniform.Uniform(0,10).sample([1000,1])
y = X**2
model = nn.Sequential(
    nn.Linear(1, 16),
    nn.ReLU(),
    nn.Linear(16, 16),
    nn.ReLU(),
    nn.Linear(16, 1),
)
loss_fn   = nn.MSELoss() 
optimizer = optim.Adam(model.parameters(), lr=0.001)
n_epochs = 150
batch_size = 50

for epoch in range(n_epochs): for i in range(0, len(X), batch_size): Xbatch = X[i:i+batch_size] y_pred = model(Xbatch) ybatch = y[i:i+batch_size] loss = loss_fn(y_pred, ybatch) optimizer.zero_grad() loss.backward() optimizer.step() print(f'Finished epoch {epoch}, latest loss {loss}')

#Example, can we square 3 - looks ok print(model(torch.tensor([3], dtype=torch.float)))

For all intents and purposes we can assume the data below is all unseen - potentially could be some overlap by random chance with training X

unseenX = torch.distributions.uniform.Uniform(-5,15).sample([1000,1])

predictions_on_unseenX = model(unseenX)

Plotting

fig, ax = plt.subplots() plt.scatter(unseenX, unseenX**2, c="red", label="Actual values", s=1) plt.scatter(unseenX, predictions_on_unseenX.detach(), c="blue", s=1, label="Predictions") plt.text(0, 100, "Training data was in this range") plt.title("Using ReLU ") plt.legend() ax.axvspan(0, 10, alpha=0.5, color='grey')

Further Reading

Interesting post on why ReLU works with the top answer focussing on this specific problem. Similar post to this one on stack exchange.