1

I would like to start saying that I am knew to statistics therefore yet a lot ahead of me. I ask for apologies if something sounds trivial.

The problem

We have a sample of 30 Nitrogen and Carbon isotope observations each taken from hard parts from a fish population. Each sample undergoes a treatment that have two phases where each phase is treated with a substance. So:

  1. A is the sample of raw isotope values
  2. B is the sample of the isotope values after treatment 1 has been applied
  3. C is the sample of the isotope values after treatment 1 and 2 have been applied.

We want to know if we can use the isotope values in C after both treatment have been applied, and then infer back what the actual raw isotope values were with no treatment via. For that we propose the simple regression line as estimator.

Sample's distribution type

By using a Shapiro-Will normality test we have determined that A is non-parametric (p-value =0.0008 and 0.02 for Nitrogen and Carbon respectively)

Regression lines performed

When we run the linear models we get that the p-value for the slope coefficients are very very significant for both ~ 10 to -12. Not so for the intercepts though but this may have an explanation based on the chemistry of the substance applied. The R Square is ~ 0.88 roughly for both indicating great correlation between A and B. Between A and C the R square is ~0.85

My Questions

a) A for both N and C, the Shapiro-wilk test yielded that they may not be normalised. However the response variable (B and C) for Carbon and Nitrogen "behaves" differently. For the former the new sampled population does follow a normal distribution but not for Nitrogen. Reading What if residuals are normally distributed, but y is not? I see that it doesn't matter, what matters is that the response is normalised just from an optimal perspective. It could even not be normalised, apparently. Is this your conclusion too, please?

b) I was thinking about using another model to compare it to the actual simple linear regression and use AIC tests in order to "prove" that linear regression as a predictor performs better. This derive from the fact that our original samples are not normally distributed. For example, I could use GLM however I do not know what distribution fits best A and GLM can only be used with exponential distributions. Is this step necessary in order to show that the linear regression performs betters than others? Would you suggest another model to try please?

c) Should I try to check if the residuals are normally distributed, which is, I believe what matters here the most? The Residuals standard error is about ~ 0.2.

My problem is that in general I am not sure if enough is enough when it comes to evaluate if the regression line is a good predictor for our case beyond the statistics I showed here.

p2gonz
  • 11
  • 3
    Normality only concerns errors/residuals. – Richard Hardy Apr 14 '22 at 17:04
  • Thanks. I actually did a Shapiro-will test for residuals normality and it is not significant either with p-values of 0.7 and 0.2 respectively so my residuals aren't normally distributed either. – p2gonz Apr 14 '22 at 17:29
  • You are not interpreting the test results correctly. High p-values of the Shapiro-Wilk test suggest that the null hypothesis of normality cannot be rejected. – Richard Hardy Apr 14 '22 at 17:33
  • You're absolutely right. I read that rather all the way around. In fact, above in my post I do interpret it well but not later on. Thanks a lot for that – p2gonz Apr 14 '22 at 17:51
  • Consider semiparametric regression: https://hbiostat.org/bib/po – Frank Harrell Apr 14 '22 at 18:25
  • Thanks a lot, I will have a look at those. – p2gonz Apr 15 '22 at 10:02

0 Answers0