Should the residuals of a machine learning regression model be i.i.d.?

Question

This is a basic question but I did not find the answer in most common statistical learning books.

In linear regression we assume that the residuals are i.i.d. Do we assume the same for a regression made by a ML algorithm?

I use in particular random forest. My residuals show spatial autocorrelation. It is not a problem itself because I can make some diagnostics and account for it. But more generally, I want to know whether this is harmful for the random forest model since it violates the i.i.d assumption of the residuals.

It relates this this question but only for the residuals.

Cagdas Ozgenc · Accepted Answer · 2019-11-19T14:27:49.210

First of all in linear regression we don't assume that residuals are IID. The assumption is that errors are IID (or at least spherical). In general when fit by OLS and errors are normally distributed, the residuals are distributed as:

$$r = (I-H)\epsilon\sim N(0,\sigma^2(I-H))$$

where $H$ is the hat matrix (https://en.wikipedia.org/wiki/Projection_matrix). As can be seen the covariance matrix is not diagonal (dependent), nor the variances are equal (not identical).

In general residuals will very unlikely to be independent as each residual is a function of the same training data which by default creates a dependency. Nor they will be identically distributed due to their position in the input space.

Having said that there is no requirement that errors must be IID in any kind of fitting algorithm. It has advantages though. Firstly, you can derive some conclusions about your estimates more easily if you know the distribution of errors. Secondly dependent errors is an indication that you haven't consumed all the available information in the data set, which means you could actually have done better with a different model. Nevertheless these are all presumptions and must be checked by examining the residuals as errors are not accessible.

JeeyCi · Answer 2 · 2022-08-03T08:12:14.317

to my opinion:

as so as "Residuals in a statistical or machine learning model are the differences between observed and predicted values of data." - you can always plot these differencies manually when evaluating your model at test_ds (or even train_ds). If see any pattern (e.g. parabolic trend) - improve your model (e.g. add one more dense layer to input this non-linearity in model's consideration). Improve your model till achieving main characteristics of a good residual plot:

high density of points close to the origin and a low density of points away from the origin
It should be symmetric about the origin

== in order to completely capture the predictive information of the data in your model (of course, you would better take into consideration residual plots as well as metrics you have chosen)

P.S. at least striving to achieve Brier score=0. Most is said by proper scoring rules used to evaluate your model (though not everything)

P.P.S. of course, they could not be i.i.d. always as so as not always you're modelling (with ML) normally distributed var. But in order to get unbiased regression - of course you need normality of residuals. Otherwise, identify the degree of non-normality in residuals and try to diagnose the problem. == for the adequacy of the model. (here and here, OR here)

Should the residuals of a machine learning regression model be i.i.d.?

2 Answers2

Related