Scaling laws for neural network memorization

Question

I would like to ask a generalization of this question: How to perfectly overfit neural network to memorize input?

Are there any scaling laws for neural network memorization? In other words, if I have M input nodes and N possible output nodes, what is the minimum number of fully connected neurons which can map any M to any N?

To keep the problem simple, let's make the following assumptions:

The network can be connected in any way. I don't care about the topology.
I don't care about the algorithm used to set the weights of the neurons in the network.
The inputs are binary and represent 2^M - 1 possible input values.
The outputs are binary and represent 2^N - 1 possible output values.

If there are subsets of the above (especially the topology) which are especially optimal for gradient descent or conversely, if there are subsets which gradient descent will have trouble optimizing, that's also really interesting.

Not sure about specific scaling laws for "memorization" itself, but what you describe seems related to function complexity (e.g., Rademacher complexity and VC dimension), the analysis of which figures into generalization bounds. There's a couple textbooks that go into more detail on such topics. — chang_trenton, Mar 07 '24 at 13:30
For a start you can be sure that when inputs repeat and can have different outputs for the same inputs, perfect memorization doesn't exactly occur (you can still overfit, of course). E.g. if your input is action film/not action film, features Tom Cruise/not with Tom Cruise and your output is international box office earnings >$100M vs. <=$100M and something about film received academy award nomination vs. not, then no size of neural network can perfectly memorize the answers, because for every combination of multiple combinations of outputs have occurred. — Björn, Mar 11 '24 at 15:49

score 1 · Answer 1 · answered Mar 12 '24 at 04:53

I've thought about this some more and I'm not sure if I can get you an exact "minimum number" or a tight scaling law, but I can get you some bounds if we make many simplifying assumptions. The theoretical tool that we'll be using is the VC dimension. This gives us a result about the level of "overfitting" (very loosely defined) in terms of the number of neurons (up to some constant factors).

Similar questions have been asked on CS.SE and on CV but seem aimed towards calculating the VC dimension of a pre-specified neural network.

Assumptions/problem setup

Suppose that there exists a true "class mapping function" $c: \{\pm 1\}^M \to \{\pm 1\}$ (this is also known as a "concept;" see Mohri et al.).
For simplicity, assume the output dimensionality is 1 ($N$), though the following arguments are generalizable to larger $N$.
We aim to learn a mapping $h: \{\pm 1 \}^M \to \{\pm 1\}$ such that for all $x \in \{\pm 1\}^M, h(x) = c(x)$.

Note that this excludes the case where we observe "different outputs for the same inputs" from the comments; i.e., we assume that there is some magical function that maps input to output perfectly. So, the practicality of this assumption is debatable, but we'll make it for simplicity.

What is a VC dimension? The VC dimension is a property of a hypothesis class that tells us the maximum number of data points $n$ such that the hypothesis class $\mathcal{H}$ contains some classifier that can solve all (in the binary case) possible labelings of the dataset. Such a hypothesis class can potentially memorize any dataset with $n$ points. In other words, a hypothesis class $\mathcal{H}$ such that $VC(\mathcal{H}) = 2^M$ (e.g., some specific neural network topologie(s)) is what you're looking for.

How to construct a neural network topology with $VC(\mathcal{H}) = 2^M$

This is Claim 20.1 from Shalev-Shwartz & Ben-David's classic textbook on machine learning theory (Understanding Machine Learning – from Theory to Algorithms) -- available online. They show that it is fairly easy to construct a two-layer (or one hidden layer) neural network with the desired VC dimension. I'll walk through the derivation, adding explanation to the original text here.

Consider a neural network with the following layers: (input, hidden layer, output), which we'll number $\mathcal{V}_0, \mathcal{V}_1, \mathcal{V}_2$. Specify layer sizes such that $|\mathcal{V}_0| = M+1$ (input dimensionality + a bias term), $|\mathcal{V}_1| = 2^M + 1$, and $|\mathcal{V}_2| = 1$. Let the sign function (-1 if negative, 1 if positive, 0 if 0) be our activation function. That is, each neuron implements a function of the form $\text{sign}(\mathbf{w}^\top \mathbf{x} + b)$. We'll define our hypothesis class as "all neural networks with the preceding topology."

Pick any target function $c: \{\pm 1\}^M \to \{\pm 1\}$. We will show that there exists some setting of neural network weights such that the model output matches $c$. Let $\mathbf{y}_1, \dots, \mathbf{y}_k$ be the set of $k$ vectors in $\{\pm 1\}^M$ such that $c(\mathbf{y}_i) = 1$ for all $i = 1, \dots, k$ (all possible "positive" examples).

Pick some $\mathbf{x} \in \{\pm 1\}$. Note that, if $\mathbf{x} = \mathbf{y}_i$, then $\mathbf{x}^\top \mathbf{y}_i = M$, and otherwise, $\mathbf{x}^\top \mathbf{y}_i \leq M - 2$ (flipping 1 to -1 or the reverse in at least one location). Thus, $M-1$ could serve as a useful cutoff point for encoding potential positives/negatives in our hidden layer $\mathcal{V}_1$. We propose the following construction:

Pick $k$ neurons in $\mathcal{V}_1$, and set their weight vector $\mathbf{w}_i := \mathbf{y}_i$ for $i = 1, \dots, k$, and their biases $b_i := -(M-1)$, yielding neurons of the form $g_i(\mathbf{x}) := \text{sign}(\mathbf{y}_i^\top \mathbf{x} - M + 1)$. We note the following observations:

If any of the $k$ relevant outputs of layer $\mathcal{V}_1$ are 1, then the input $\mathbf{x}$ belonged to the positive class, and their sum is 1.
Similarly, if all $k$ relevant outputs of layer $\mathcal{V}_1$ are -1, then the input $\mathbf{x}$ belonged to the negative class, and their sum is $-k$.

Thus, set the final layer (which has one neuron) to $\text{sign}(\sum_{i=1}^k g_i(\mathbf{x}) + k - 1)$. Note that this is equivalent to assigning ones to $k$ elements of the weight vector.
Set the remaining weights to zero (so they have no effect on sign activations).

Since our choice of concept $c$ was arbitrary, this neural network topology can be adapted to learn any possible function $\{\pm 1\}^M \to \{\pm 1\}$. By extension, this hypothesis class has $VC(\mathcal{H}) = 2^M$ as desired. As a final act of housekeeping, note that we used $O(2^M)$ neurons (specifically, $2^M + M + 2$).

Caveats: Note that this result only concerns very shallow neural network topologies with a sign activation function. I'm not sure if generalizations to deeper architectures exist.

Is the number of neurons required actually $\Omega(2^M)$?

It is harder to prove a rigorous lower bound (i.e., is this really the minimal configuration) in my opinion, but I'll try to make an informal argument for now. Note that to solve any classification problem defined by some $c: \{\pm 1\}^M \to \{\pm 1\}$, our model has to be able to learn $2^{2^M}$ different functions. Each neuron has $M+1 = O(M)$ parameters, so it can "take responsibility" for up to $O(2^M)$ such functions. Then, since $2^{2^M} / O(2^M) = \Omega(2^M)$, this suggests that an exponential (in input dimensionality) number of neurons are indeed required.

I've seen better bounds (Theorem 20.6 + 20.2 of Shalev-Shwartz & Ben David claim $\Omega(2^{M/3})$, but I seem to have missed some part of their argument), but I'm not familiar enough with the proofs to vouch for them. The punchline here is the exponential dependence.

Thus, we have a construction showing that it takes an exponential amount of neurons to classify any arbitrary mapping of $M$-dimensional Boolean vectors to 1D binary labels.

As for the second part of the question about whether GD would have trouble/be well-suited for optimizing any particular topologies--my intuition is that, since the resultant objectives are generally highly non-convex, finding the exact solution is "usually" NP-hard, especially since we're trying to make claims across all possible classification problems of a certain size.

Well, this is as close to a scaling law as I was able to get -- hopefully this gets you partway to what you're looking for, and let me know if you have any questions.

Bill Vander Lugt · Answer 2 · 2024-03-15T04:29:50.073

The best paper I have seen on the subject of scaling laws for neural network memorization was published at ICLR last year: Provable Memorization Capacity of Transformers. Since you "don't care about the topology", I trust this paper's focus on transformers won't bother you. Indeed, the remarkably success of Transformer architectures in recent years across a broad variety of deep learning tasks makes the authors' focus on this now-dominant architecture seem entirely appropriate.

Because your formulation was, it appears, chosen primarily to render the problem as simple and straightforward as possible, I trust you will also not be disappointed, given the impressive theoretical rigor of this paper, that the authors have cast their primary result in slightly different terms than you specified:

We prove that Transformers are capable of memorizing N sequence-to-sequence mappings of length n with d-dimensional input tokens using $O(d + n + \sqrt{nN})$ parameters.

I hope you find their proofs and results as impressive as I did.

Scaling laws for neural network memorization

2 Answers2

Assumptions/problem setup

How to construct a neural network topology with $VC(\mathcal{H}) = 2^M$

Is the number of neurons required actually $\Omega(2^M)$?