Tree based models such as (gradient boosting or random forest) have a lot of advantages, such as robust to collinearity and outliers.
I can see deep neural networks (MLP) are robust to collinearity. But are they robust to outliers? and Why?
Tree based models such as (gradient boosting or random forest) have a lot of advantages, such as robust to collinearity and outliers.
I can see deep neural networks (MLP) are robust to collinearity. But are they robust to outliers? and Why?
Relative to a standard multiple regression model, I believe an MLP is much more robust to outliers. This is the case for several reasons:
1) the multiple regression has only one single shot at fitting the data. Meanwhile the MLP has so many more opportunities to fit the data by varying the number of nodes and hidden layers to use to fit the data. This more flexible fitting mechanism should allow the MLP to underweight the impact of outliers (relative to either a Y or a X variable);
2) MLPs activation functions typically use a Logit Regression mechanism (Sigmoid) or a Tangent Hyperbolic function (Tanh). The former generates intermediary outputs between 0 and 1 and the latter between -1 and +1. Those activation functions further enhance the capability of MLPs to deal with non-linear events and outliers.
3) MLPs can incorporate regularization mechanisms. The latter should assist in resolving multicollinearity and reducing the impact of outliers.
Also, to diagnose the impact of outliers on your MLPs, you can also do cross validation.
However, if your main objective is to reduce the impact of outliers there are more transparent ways to deal with that. Tree based models are certainly a good way to do that, as you mentioned. But, there is also a whole family of Robust Regression models. Some of them combine with regularization mechanisms to resolve the multicollinearity issue. And, those models are far easier to explain to a non-specialized audience.
Based on this article by Lin and Tegmark (below), I think the answer is "it depends." In other words and as they state, most deep learning algorithms assume a lognormal distribution. As long as your data fits that assumption, then there is no problem. The problem is when the tails of the distribution are more extreme than lognormal, e.g., power lawed or super-exponential. Their Figure 1 and related discussion outline the issues with the lack of tail fit based on lognormality across several different data types and solutions are proposed, specifically in the context of deep learning NNs.
Lin and Tegmark, Critical Behavior from Deep Dynamics: A Hidden Dimension in Natural Language arXiv:1606.06737
Multilayer Perceptron (MLP) are sensitive to outliers.
MLP are universal approximators i.e. they can be used to approximate any target function. With such an expressive hypothesis space, MLP may risk overfitting by learning from noise (outliers).
Outliers can also cause slow/no learning to take place because of the vanishing gradient problem. Activations saturate at either tail of 0 or 1, and gradients are near zero in these regions.
If a feature has a variance that is orders of magnitude larger than others, it might dominate the objective function (unable to learn from other features).
Hence, as a preprocessing step, it is highly recommended to apply standardization on the training data to reduce outliers.
References: