I've thought about this some more and I'm not sure if I can get you an exact "minimum number" or a tight scaling law, but I can get you some bounds if we make many simplifying assumptions.
The theoretical tool that we'll be using is the VC dimension. This gives us a result about the level of "overfitting" (very loosely defined) in terms of the number of neurons (up to some constant factors).
Similar questions have been asked on CS.SE and on CV but seem aimed towards calculating the VC dimension of a pre-specified neural network.
Assumptions/problem setup
- Suppose that there exists a true "class mapping function" $c: \{\pm 1\}^M \to \{\pm 1\}$ (this is also known as a "concept;" see Mohri et al.).
- For simplicity, assume the output dimensionality is 1 ($N$), though the following arguments are generalizable to larger $N$.
- We aim to learn a mapping $h: \{\pm 1 \}^M \to \{\pm 1\}$ such that for all $x \in \{\pm 1\}^M, h(x) = c(x)$.
Note that this excludes the case where we observe "different outputs for the same inputs" from the comments; i.e., we assume that there is some magical function that maps input to output perfectly. So, the practicality of this assumption is debatable, but we'll make it for simplicity.
What is a VC dimension? The VC dimension is a property of a hypothesis class that tells us the maximum number of data points $n$ such that the hypothesis class $\mathcal{H}$ contains some classifier that can solve all (in the binary case) possible labelings of the dataset. Such a hypothesis class can potentially memorize any dataset with $n$ points. In other words, a hypothesis class $\mathcal{H}$ such that $VC(\mathcal{H}) = 2^M$ (e.g., some specific neural network topologie(s)) is what you're looking for.
How to construct a neural network topology with $VC(\mathcal{H}) = 2^M$
This is Claim 20.1 from Shalev-Shwartz & Ben-David's classic textbook on machine learning theory (Understanding Machine Learning – from Theory to Algorithms) -- available online. They show that it is fairly easy to construct a two-layer (or one hidden layer) neural network with the desired VC dimension. I'll walk through the derivation, adding explanation to the original text here.
Consider a neural network with the following layers: (input, hidden layer, output), which we'll number $\mathcal{V}_0, \mathcal{V}_1, \mathcal{V}_2$. Specify layer sizes such that $|\mathcal{V}_0| = M+1$ (input dimensionality + a bias term), $|\mathcal{V}_1| = 2^M + 1$, and $|\mathcal{V}_2| = 1$. Let the sign function (-1 if negative, 1 if positive, 0 if 0) be our activation function. That is, each neuron implements a function of the form $\text{sign}(\mathbf{w}^\top \mathbf{x} + b)$. We'll define our hypothesis class as "all neural networks with the preceding topology."
Pick any target function $c: \{\pm 1\}^M \to \{\pm 1\}$. We will show that there exists some setting of neural network weights such that the model output matches $c$. Let $\mathbf{y}_1, \dots, \mathbf{y}_k$ be the set of $k$ vectors in $\{\pm 1\}^M$ such that $c(\mathbf{y}_i) = 1$ for all $i = 1, \dots, k$ (all possible "positive" examples).
Pick some $\mathbf{x} \in \{\pm 1\}$. Note that, if $\mathbf{x} = \mathbf{y}_i$, then $\mathbf{x}^\top \mathbf{y}_i = M$, and otherwise, $\mathbf{x}^\top \mathbf{y}_i \leq M - 2$ (flipping 1 to -1 or the reverse in at least one location). Thus, $M-1$ could serve as a useful cutoff point for encoding potential positives/negatives in our hidden layer $\mathcal{V}_1$. We propose the following construction:
- Pick $k$ neurons in $\mathcal{V}_1$, and set their weight vector $\mathbf{w}_i := \mathbf{y}_i$ for $i = 1, \dots, k$, and their biases $b_i := -(M-1)$, yielding neurons of the form $g_i(\mathbf{x}) := \text{sign}(\mathbf{y}_i^\top \mathbf{x} - M + 1)$. We note the following observations:
- If any of the $k$ relevant outputs of layer $\mathcal{V}_1$ are 1, then the input $\mathbf{x}$ belonged to the positive class, and their sum is 1.
- Similarly, if all $k$ relevant outputs of layer $\mathcal{V}_1$ are -1, then the input $\mathbf{x}$ belonged to the negative class, and their sum is $-k$.
- Thus, set the final layer (which has one neuron) to $\text{sign}(\sum_{i=1}^k g_i(\mathbf{x}) + k - 1)$. Note that this is equivalent to assigning ones to $k$ elements of the weight vector.
- Set the remaining weights to zero (so they have no effect on sign activations).
Since our choice of concept $c$ was arbitrary, this neural network topology can be adapted to learn any possible function $\{\pm 1\}^M \to \{\pm 1\}$. By extension, this hypothesis class has $VC(\mathcal{H}) = 2^M$ as desired. As a final act of housekeeping, note that we used $O(2^M)$ neurons (specifically, $2^M + M + 2$).
Caveats: Note that this result only concerns very shallow neural network topologies with a sign activation function. I'm not sure if generalizations to deeper architectures exist.
Is the number of neurons required actually $\Omega(2^M)$?
It is harder to prove a rigorous lower bound (i.e., is this really the minimal configuration) in my opinion, but I'll try to make an informal argument for now. Note that to solve any classification problem defined by some $c: \{\pm 1\}^M \to \{\pm 1\}$, our model has to be able to learn $2^{2^M}$ different functions.
Each neuron has $M+1 = O(M)$ parameters, so it can "take responsibility" for up to $O(2^M)$ such functions.
Then, since $2^{2^M} / O(2^M) = \Omega(2^M)$, this suggests that an exponential (in input dimensionality) number of neurons are indeed required.
I've seen better bounds (Theorem 20.6 + 20.2 of Shalev-Shwartz & Ben David claim $\Omega(2^{M/3})$, but I seem to have missed some part of their argument), but I'm not familiar enough with the proofs to vouch for them. The punchline here is the exponential dependence.
Thus, we have a construction showing that it takes an exponential amount of neurons to classify any arbitrary mapping of $M$-dimensional Boolean vectors to 1D binary labels.
As for the second part of the question about whether GD would have trouble/be well-suited for optimizing any particular topologies--my intuition is that, since the resultant objectives are generally highly non-convex, finding the exact solution is "usually" NP-hard, especially since we're trying to make claims across all possible classification problems of a certain size.
Well, this is as close to a scaling law as I was able to get -- hopefully this gets you partway to what you're looking for, and let me know if you have any questions.