Identifying outliers for non linear regression

Question

I am doing research on the field of functional response of mites. I would like to do a regression to estimate the parameters (attack rate and handling time) of the Rogers type II function. I have a dataset of measurements. How can I can best determine outliers?

For my regression I use the following script in R (a non linear regression): (the dateset is a simple 2 column text file called data.txt file with N0 values (number of initial prey) and FR values (number of eaten prey during 24 hours):

library("nlstools")
dat <- read.delim("C:/data.txt")    
#Rogers type II model
a <- c(0,50)
b <- c(0,40)
plot(FR~N0,main="Rogers II normaal",xlim=a,ylim=b,xlab="N0",ylab="FR")
rogers.predII <- function(N0,a,h,T) {N0 - lambertW(a*h*N0*exp(-a*(T-h*N0)))/(a*h)}
params1 <- list(attackR3_N=0.04,Th3_N=1.46)
RogersII_N <-  nls(FR~rogers.predII(N0,attackR3_N,Th3_N,T=24),start=params1,data=dat,control=list(maxiter=    10000))
hatRIIN <- predict(RogersII_N)
lines(spline(N0,hatRIIN))
summary(RogersII_N)$parameters

For plotting the calssic residuals graphs I use following script:

res <- nlsResiduals (RogersII_N)
plot (res, type = 0)
hist (res$resi1,main="histogram residuals")
    qqnorm (res$resi1,main="QQ residuals")
hist (res$resi2,main="histogram normalised residuals")
    qqnorm (res$resi2,main="QQ normalised residuals")
par(mfrow=c(1,1))
boxplot (res$resi1,main="boxplot residuals")
    boxplot (res$resi2,main="boxplot normalised residuals")

Questions

How can I best determine which data points are outliers?
Are there tests I can use in R which are objective and show me which data points are outliers?

score 9 · Answer 1 · edited Apr 13 '17 at 12:44

Several tests for outliers, including Dixon's and Grubb's, are available in the outliers package in R. For a list of the tests, see the documentation for the package. References describing the tests are given on the help pages for the corresponding functions.

In case you were planning to remove the outliers from your data, bear in mind that this isn't always advisable. See for instance this question for a discussion on this (as well as some more suggestions on how to detect outliers).

score 8 · Answer 2 · answered Jun 07 '12 at 18:10

Neither am I a statistician. Therefore I use my expert knowledge about the data to find outliers. I.e. I look for physical/biological/whatever reasons that made some measurements different from the others.

In my case that is e.g.

cosmic rays messing up part of the measured signal
someone entering the lab, switching on the light
just the whole spectrum somehow looks different
the first measurement series was taken during normal work hours and is an order of magniture more noisy than the 10 pm series

Surely you could tell us similar effects.

Note that my 3rd point is different from the others: I don't know what happened. This may be the kind of outlier you're asking about. However, without knowing what caused it (and that this cause invalidates the data point) it is difficult to say that it shouldn't appear in the data set. Also: your outlier may be my most interesting sample...

Therefore, I often do not speak of outliers, but of suspicious data points. This reminds everyone that they need to be double checked for their meaning.

Whether it is good or not to exclude data (who wants to find outliers just for the sake of having them?) depends very much on what the task at hand is and what the "boundary conditions" for that task are. Some examples:

you just discovered the new outlierensis Joachimii subspecies ;-) no reason to exclude them. Exclude all others.
you want to predict preying times of mites. If it is acceptable to restrict the prediction to certain conditions, you could formulate these and exclude all other samples and say your predictive model deals with this or that situation, though you already know other situations (describe outlier here) do occur.
Keep in mind that excluding data with the help of model diagnostics can create a kind of a self-fulfilling prophecy or an overoptimistic bias (i.e. if you claim your method is generally applicable): the more samples you exclude because they don't fit your assumptions, the better are the assumptions met by the remaining samples. But that's only because of exclusion.
I currently have a task at hand where I have a bunch of bad measurements (I know the physical reason why I consider the measurement bad), and a few more that somehow "look weird". What I do is that I exclude these samples from trainig of a (predicitve) model, but separately test the model with these so I can say something about the robustness of my model against outliers of those types which I know will occur every once in a while. Thus, the application somehow or other needs to deal with these outliers.
Yet another way to look at outliers is asking: "How much do they influence my model?" (Leverage). From this point of view you can measure robustness or stability with respect to weird training samples.
Whatever statistical procedure you use, it will either not identify any outliers, or also have false positives. You can characterize an outlier testing procedure like other diagnostic tests: it has a sensitivity and a specificity, and - more important for you - they correspond (via the outlier proportion in your data) to a positive and negative predictive value. In other words, particularly if your data has very few outliers, the probablility that a case identified by the outlier test really is an outlier (i.e. shouldn't be in the data) can be very low.
I believe that expert knowledge about the data at hand is usually much better at detecting outliers than statistical tests: the test is just as good as the assumptions behind it. And one-size-fits-all is often not really good for data analysis. At least I frequently deals with a kind of outliers, where experts (about that type of measurement) have no problem identifying the exact part of the signal that is compromised while automated procedures often fail (it is easy to get them detecting that there is a problem, but very difficult to get them finding where the problem begins and where it ends).

There's a lot of good information here. I especially like bullet points #4 & 5. — gung - Reinstate Monica, Jun 08 '12 at 18:47

score 4 · Answer 3 · edited Jun 07 '12 at 16:21

For univariate outliers there is Dixon's ratio test and Grubbs' test assuming normality. To test for an outlier you have to assume a population distribution because you are trying to show that the observed value is extreme or unusual to come from the assumed distribution. I have a paper in the American Statistician in 1982 that I may have referenced here before which shows that Dixon's ratio test can be used in small samples even for some non-normal distributions. Chernick, M.R. (1982)"A Note on the Robustness of Dixon's Ratio in Small Samples" American Statistician p 140. For multivariate outliers and outliers in time series, influence functions for parameter estimates are useful measures for detecting outliers informally (I do not know of formal tests constructed for them although such tests are possible). Look at Barnett and Lewis' text "Outliers in Statistical Data" for detailed treatment of outlier detection methods.

score 3 · Answer 4 · edited Jun 11 '20 at 14:32

See http://www.waset.org/journals/waset/v36/v36-45.pdf, "On the outlier Detection in Nonlinear Regression" [sic].

Abstract

The detection of outliers is very essential because of their responsibility for producing huge interpretative problem in linear as well as in nonlinear regression analysis. Much work has been accomplished on the identification of outlier in linear regression, but not in nonlinear regression. In this article we propose several outlier detection techniques for nonlinear regression. The main idea is to use the linear approximation of a nonlinear model and consider the gradient as the design matrix. Subsequently, the detection techniques are formulated. Six detection measures are developed that combined with three estimation techniques such as the Least-Squares, M and MM-estimators. The study shows that among the six measures, only the studentized residual and Cook Distance which combined with the MM estimator, consistently capable of identifying the correct outliers.

+1 Despite the obvious problems with English (and in the mathematical typesetting), this paper appears to be a useful contribution to the question. — whuber, Oct 05 '12 at 12:23

Harvey Motulsky · Answer 5 · 2012-10-05T13:37:25.647

An outlier is a point that is "too far" from "some baseline". The trick is to define both those phrases! With nonlinear regression, one can't just use univariate methods to see if an outlier is "too far" from the best-fit curve, because the outlier can have an enormous influence on the curve itself.

Ron Brown and I developed a unique method (which we call ROUT -- Robust regression and Outlier removal) for doing detecting outliers with nonlinear regression, without letting the outlier affect the curve too much. First fit the data with a robust regression method where outliers have little influence. That forms the baseline. Then use the ideas of the False Discovery Rate (FDR) to define when a point is "too far" from that baseline, and so is an outlier. Finally, it removes the identified outliers, and fits the remaining points conventionally.

The method is published in an open access journal: Motulsky HJ and Brown RE, Detecting outliers when fitting data with nonlinear regression – a new method based on robust nonlinear regression and the false discovery rate, BMC Bioinformatics 2006, 7:123. Here is the abstract:

Background. Nonlinear regression, like linear regression, assumes that the scatter of data around the ideal curve follows a Gaussian or normal distribution. This assumption leads to the familiar goal of regression: to minimize the sum of the squares of the vertical or Y-value distances between the points and the curve. Outliers can dominate the sum-of-the-squares calculation, and lead to misleading results. However, we know of no practical method for routinely identifying outliers when fitting curves with nonlinear regression.

Results. We describe a new method for identifying outliers when fitting data with nonlinear regression. We first fit the data using a robust form of nonlinear regression, based on the assumption that scatter follows a Lorentzian distribution. We devised a new adaptive method that gradually becomes more robust as the method proceeds. To define outliers, we adapted the false discovery rate approach to handling multiple comparisons. We then remove the outliers, and analyze the data using ordinary least-squares regression. Because the method combines robust regression and outlier removal, we call it the ROUT method.

When analyzing simulated data, where all scatter is Gaussian, our method detects (falsely) one or more outlier in only about 1–3% of experiments. When analyzing data contaminated with one or several outliers, the ROUT method performs well at outlier identification, with an average False Discovery Rate less than 1%.

Conclusion. Our method, which combines a new method of robust nonlinear regression with a new method of outlier identification, identifies outliers from nonlinear curve fits with reasonable power and few false positives.

It has not (as far as I know) been implemented in R. But we implemented it in GraphPad Prism. and provide a simple explanation in the Prism help.

score 0 · Answer 6 · answered Oct 04 '12 at 19:26

Your question is too general. There is no single best method to exclude the "outliers".

You had to know some properties on the "outliers". or you do not know which method is the best. After deciding which method you want to use, you need to calibrate the parameters of the method carefully.

Identifying outliers for non linear regression

Questions

6 Answers6

Abstract

Linked