0

In a multiple regression problem, suppose we have responses $Y_1, Y_2, \cdots , Y_n$ corresponding to data $\mathbf{X}_1, \mathbf{X}_2, \cdots, \mathbf{X}_n$ where each $\mathbf{X}_i$ is a $d$-dimensional vector. The covariates are arising from some continuous distribution.

Now, we can of course employ our good old multiple linear regression here. However, a major problem is that our dataset may have outliers in any of the covariates and/or in the responses. In cases like that, linear regression (OLS-based) often fails miserably. We can use some other regression techniques, viz. LAD Regression or LMS Regression which can somewhat take care of the outliers, but they are computationally challenging, and there might be multiple solutions which are not statistically meaningful. So we are sticking to OLS.

A natural approach is to detect some outliers, throw them out, and perform regression again. However, I haven't heard of any such statistical method that can detect multivariate outliers. Please share your knowledge, thoughts, or resources on this problem.

JRC
  • 609
  • Are you planning to use Python for the outlier detection? If so, sklearn has quite a few techniques available, some shown in a nice plot at the start of section 2.7.1 here: https://scikit-learn.org/stable/modules/outlier_detection.html – Alex Oct 02 '22 at 08:45
  • I am much more comfortable in R. Thanks anyway. I'll look into the methods discussed in your link. – JRC Oct 02 '22 at 08:55
  • Re "haven't heard:" please try searching our site. – whuber Oct 02 '22 at 15:31

0 Answers0