1

I have a regression data set with ~1000 features mapping to a single value. Neural networks are consistently 2x more accurate than linear regression, at least with the features I am using.

I am not overfitting either, as I am comparing validation and test errors, and keeping my NNs rather small.

I am wondering how to explain why neural networks are more accurate than linear when using a particular set of features.

Intuitively, we can realize that the NNs are successfully finding nonlinear relationships between the features and the data, and apparently that matters for these particular features.

So here's what I tried:

  1. Calculating $R^2$ values for linear regression. In my case, $R^2=0.98$, suggesting that linear regression is okay.
  2. Calculating PCA eigenvalues of the features, but it looks like they decay rather quickly: PCA of features This suggests that PCA is capable of finding linear combinations of the data successfully?

I am confused by these two results ($R^2=0.98$ and quickly decaying eigenvalues), because it suggests that linear regression should perform well.

Are there other analyses that explain why some features are more suitable for nonlinear regression compared to linear regression?

Maybe a correlation matrix of the features or something?

  • 2
    How many observations do you have? How do you measure the accuracy? Did you standardise the features before running the PCA? – Christian Hennig Jan 07 '23 at 16:52
  • @ChristianHennig About 1 million observations. Accuracy = MAE on validation and test sets. Features are standardized before PCA. – interatomic Jan 07 '23 at 17:00
  • 2
    PCA has to do with the variance of the features, not the variance in the outcome that can be explained by the features. – Dave Jan 07 '23 at 17:20
  • You could try plotting the target variable against the ten features constructed by your PCA... this should show any significant nonlinearities, although it won't show interaction effects except incidentally. – jbowman Jan 07 '23 at 17:21
  • @Dave yeah good point. In this case we should consider variance of the outcome that can be explained by the features, which I thought was $R^2$. – interatomic Jan 07 '23 at 17:22
  • $R^2$ only has that interpretation in special settings that exclude nonlinear models like neural networks, though $R^2$ is a reasonable transformation of mean squared error [that relates to a comparison to the performance a simple baseline model, depending on how you do the calculation. – Dave Jan 07 '23 at 17:27
  • Outliers may have a big impact on least squares linear regression and can also produce a deceptively high value of $R^2$. (With a data set as big as this you probably need a good number of outliers, maybe a cluster of them, but anyway.) – Christian Hennig Jan 07 '23 at 18:28
  • Returning to this, when you write that the neural network is consistently 2x more accurate, what do you mean? You say that the linear model gets $R^2=0.98$, and you cannot mean that your neural network gets $R^2 = 1.96$. (If you do, some debugging will be useful.) – Dave Mar 16 '23 at 23:39
  • 2
    That PCA can reduce the dimensionality of the data with little loss of information is no indication whatsoever that any of the observed features are at all related to the value you're trying to predict. If I give you a dataset that contains individuals' heights measured in inches, centimeters, feet, furlongs, and fathoms, PCA will reduce that to a single dimension with no loss of variability, but that of course doesn't mean the features are useful for an arbitrary classification task like predicting the person's social security number. – Nuclear Hoagie Apr 18 '23 at 17:44

1 Answers1

0

I am wondering how to explain why neural networks are more accurate than linear when using a particular set of features.

Neural networks allow for an enormous number of interactions and nonlinear transformations of the original features. In some regard, neural networks do the feature engineering for you, so you do not have to figure out that a particular interaction matters or that some feature should be squared.

If you get a good fit with a linear model but reliably get an even better fit with a neural network on those same features, it would seem that those nonlinear features and interactions discovered by the neural network matter to the outcome. The math does not work out as cleanly as it does for nested GLMs, but you can think of this as analogous to fitting a model, fitting a more complex model, testing the added features, and getting a low p-value (e.g., partial F-test, "chunk" test in more generality).

To some extent, you are seeing the universal approximation theorems in action. Loosely speaking, the various universal approximation theorems say that a decent$^{\dagger}$ function can be approximated arbitrarily well by a neural network of sufficient size. A linear combination of the raw features has no such guarantee, hence the stronger performance of the neural network.

Where you can get into trouble is that linear models also can involve feature interactions and nonlinear features. The universal approximation theorems say that neural networks can approximate decent functions as well as is desired. The Stone-Weierstrass theorem says about the same about polynomial regressions (which are linear models). However, you have to tell the computer what those polynomial features are. You cannot just change model.add(tf.keras.layers.Dense(32, activation='relu')) to model.add(tf.keras.layers.Dense(320, activation='relu')) to get more nonlinearity. This makes it quite easy to increase model flexibility in a neural network compared to a linear model, for better or for worse.

Cheng el. al (2018) have an interesting arXiv paper on polynomial regression vs neural networks.

REFERENCE

Cheng, Xi, et al. "Polynomial regression as an alternative to neural nets." arXiv preprint arXiv:1806.06850 (2018).

$^{\dagger}$This is deliberately vague.

Dave
  • 62,186