Are deep neural networks robust to outliers?

Question

Tree based models such as (gradient boosting or random forest) have a lot of advantages, such as robust to collinearity and outliers.

I can see deep neural networks (MLP) are robust to collinearity. But are they robust to outliers? and Why?

What does "robust to collinearity" and "robust to outliers" mean? — Matthew Drury, Jun 24 '17 at 15:59

Sympa · Answer 1 · 2017-06-26T00:09:56.427

Relative to a standard multiple regression model, I believe an MLP is much more robust to outliers. This is the case for several reasons:

1) the multiple regression has only one single shot at fitting the data. Meanwhile the MLP has so many more opportunities to fit the data by varying the number of nodes and hidden layers to use to fit the data. This more flexible fitting mechanism should allow the MLP to underweight the impact of outliers (relative to either a Y or a X variable);

2) MLPs activation functions typically use a Logit Regression mechanism (Sigmoid) or a Tangent Hyperbolic function (Tanh). The former generates intermediary outputs between 0 and 1 and the latter between -1 and +1. Those activation functions further enhance the capability of MLPs to deal with non-linear events and outliers.

3) MLPs can incorporate regularization mechanisms. The latter should assist in resolving multicollinearity and reducing the impact of outliers.

Also, to diagnose the impact of outliers on your MLPs, you can also do cross validation.

However, if your main objective is to reduce the impact of outliers there are more transparent ways to deal with that. Tree based models are certainly a good way to do that, as you mentioned. But, there is also a whole family of Robust Regression models. Some of them combine with regularization mechanisms to resolve the multicollinearity issue. And, those models are far easier to explain to a non-specialized audience.

For (1) the higher complexity of a DNN actually causes it to overweight (overfit) outliers, which may be undesirable. Compared to a deep architecture, A shallow NN (e.g., a multiple regression) is less influenced by noise (outliers). — User0, Jun 24 '17 at 10:32
Influence can have two different meanings. Influence on the estimate, in this case you may be right. I agree that DNN tend to overfit. I am not sure overfitting is exactly the same issue as being vulnerable to outliers. Influence on the coefficient is another meaning. And, in such case I would advance that regressions are much more vulnerable to outliers than DNN. A single outlier out of 50 observations can change the statistical significance of a regression coefficient or even its sign. I think it is much less the case for DNNs. — Sympa, Jun 25 '17 at 21:17

score 3 · Answer 2 · answered Jun 24 '17 at 15:10

Based on this article by Lin and Tegmark (below), I think the answer is "it depends." In other words and as they state, most deep learning algorithms assume a lognormal distribution. As long as your data fits that assumption, then there is no problem. The problem is when the tails of the distribution are more extreme than lognormal, e.g., power lawed or super-exponential. Their Figure 1 and related discussion outline the issues with the lack of tail fit based on lognormality across several different data types and solutions are proposed, specifically in the context of deep learning NNs.

Lin and Tegmark, Critical Behavior from Deep Dynamics: A Hidden Dimension in Natural Language arXiv:1606.06737

ruhong · Answer 3 · 2020-12-11T21:06:38.210

Multilayer Perceptron (MLP) are sensitive to outliers.

MLP are universal approximators i.e. they can be used to approximate any target function. With such an expressive hypothesis space, MLP may risk overfitting by learning from noise (outliers).

Outliers can also cause slow/no learning to take place because of the vanishing gradient problem. Activations saturate at either tail of 0 or 1, and gradients are near zero in these regions.

If a feature has a variance that is orders of magnitude larger than others, it might dominate the objective function (unable to learn from other features).

Hence, as a preprocessing step, it is highly recommended to apply standardization on the training data to reduce outliers.

References:

LeCun, Y. A., Bottou, L., Orr, G. B., & Müller, K. R. (2012). Efficient backprop. In Neural networks: Tricks of the trade (pp. 9-48). Springer, Berlin, Heidelberg.
Pang-Ning Tan, Michael Steinbach, Anuj Karpatne, and Vipin Kumar. 2018. Introduction to Data Mining (2nd Edition) (2nd. ed.). Pearson. Chapter 5.4 Artificial Neural Networks

Are deep neural networks robust to outliers?

3 Answers3