0

I'm trying to determine if different linear regressions are statistically different from one another. For that, I used the regression function for my different treatments (round, triangle and square in the figure) and then used the slope values (a) and error. So I did t-tests to see if treatments are significantly different using this set of values for each treatment: (a-error,a,a+error).

My problem is that, even though the linear regression is significative (p-value<0.05), I can't technically use it, as it is a parametric alternative and the application condition on my residus are not respected. For each of my treatments, I performed Durbin-Watson test, Goldfeld-Quandt (for linear regressions) and Shapiro-Wilks and I've got p-values<0.05 for some of them, sometimes all of them (depending on the treatment). I've tried all the transformations I could think off (log10, square root, double square root, square, inverse), and the problem remains. It doesn't seem there are outliers in my dataset either.

So here are my specific questions:

  1. Is there another way I can make this work with linear regression? Can I still use that that function (my data set is small and it might explain why is the application conditions are not verified on the residues)?

  2. What would be a good non parametric alternative? I've been exploring Spearman correlation, but even with the "conf.level = 0.95" command, I don't get an error so I can't perform the t-tests. I'm glad to know there is a great correlation between my quantitative variables, but that's something I already know. Is there a way I can compare Spearman correlations?

  3. Is there a third option? I'm currently trying the Poisson regression, as my data is not continuous (there are clusters along the regressions, see figure) but it seems I get a lot of warnings when I apply it to my data and I'm not sure why.

I'm a beginner with both R and statistics, so thank you for your understanding and thank you for your help :)

enter image description here

  • Have you tried a log or square-root transformation of your dependent variable? It looks like the error increases with the magnitude of the observation, and such transformations can bring the residual distribution closer to normal. Or you could try ordinal regression, which doesn't require assumptions about the error distribution. See for example the orm() function in the rms package in R. – EdM Apr 04 '22 at 18:14
  • Thank you so much for your help ! I have tried to transform only the variable y and it doesn't really help. Normality is not necessarity the problem as the residues are not centered either and the residue variance is not always constant. I'm in the process of trying out this orm() function and it seems promissing. But, the coefficients obtained are very different to the slopes I get with linear regression. Are they another type of coefficient, or am I doing something wrong ? – Océane Hourtané Apr 05 '22 at 21:16
  • I’m troubled by your statement that the residuals aren’t centered in ordinary least-squares regression. Does that mean that the average residual isn’t 0? That won’t happen if you include an intercept in your model. Are you doing this regression without an intercept? That’s almost never a good idea. See this page, for example. – EdM Apr 07 '22 at 13:36
  • I did not include an itercept because this is work in ecotoxicology and our measure concentrations for controls are never exactly 0. So, there is a 0 that is not really a 0. If that makes any sense. I feel unconfortable with forcing the intercept. – Océane Hourtané Apr 20 '22 at 20:05
  • When you don’t include an intercept you are forcing the intercept to be exactly 0. Letting the model choose the intercept is the best way to handle that uncertainty in measurements. If you force the intercept to be 0 then you will not generally get residuals with a mean of 0. – EdM Apr 20 '22 at 20:16

1 Answers1

1

The error in the y values increases with y. Here are some sample data with R that resemble what's shown.

library(ggplot2)
library(rms)
addPropNoise <- function(x,prop) {x+rnorm(1,0,prop*x)}; addPropNoise <- Vectorize(addPropNoise) ## function to add proportional noise
x <- rep(c(10,25,50,100,200,300),6); x <- x[order(x)]
set.seed(136)
x <- addPropNoise(x,1/30)
y <- 0.5 * x
y <- addPropNoise(y,1/10)
df1 <- data.frame(x=x,y=y)
ggplot(data=df1,mapping=aes(x=x,y=y)) + geom_point() + geom_smooth(method=lm,se=FALSE) ## similar to plot of OP

standard linear regression fit

A comment indicated that the mean residual might not have been 0. That can happen if you omit the intercept from the model:

noIntFit <- lm(y~x-1,df1)
mean(noIntFit$residuals)
## [1] -0.09983617

but not otherwise:

lmFit <- lm(y~x,df1)
mean(lmFit$residuals)
## [1] -8.177294e-17

In this case, a log-log fit works nicely, as we know that the standard deviation in the y values is proportional to their values and that y is proportional to x.

ggplot(data=df1,mapping=aes(x=log(x),y=log(y))) + geom_point() + geom_smooth(method=lm,se=FALSE) ## log-log fit, designed to work

log-log fit to data with increasing variance at high y

As we know that the variance in y increases with the square of y, we can try a weighted linear-regression model:

weightFit <- lm(y~x,df1,weights=(1/y^2))

The q-q plot from the weighted model (not shown) is almost perfect.

The problem with log-transforming only the y values is that you lose the specified linear association between y and x. If you expect that a log transform will help with stabilizing the error variance and you don't know the underlying relationship between y and x, you can use a restricted cubic spline to let the data tell you:

dd1 <- datadist(df1)
options(datadist="dd1")
rcsLogFit <- ols(log(y)~rcs(x),data=df1)
ggplot(Predict(rcsLogFit)) + geom_point(data=df1,mapping=aes(x=x,y=log(y))) +geom_function(fun=function(x) {log(0.5)+log(x)},color="blue")

The spline (black) almost perfectly matches the theoretical relationship (blue).

fit of log(y) against regression spline for untransformed x

Remember that with this prior log transformation you are modeling the mean of log(y) rather than the log of the mean of y.

This example was designed to work well with a parametric model. You could consider quantile or ordinal regression if parametric models don't work adequately.

EdM
  • 92,183
  • 10
  • 92
  • 267