How do I go about rectifying violated assumptions when more than one is violated at the same time?

Question

I am currently trying to run a model analyzing the duration of the egg stage of each sex of two species of insect across five different temperatures. All independent variables are categorical. My model looks like this:

model1<-lm(Duration.egg~Temperature*Species*Sex, data = egg.na.2)

In building this model I have tried to evaluate the assumptions of:

Normality of residuals
Homoscedasticity
No serial autocorrelation
no unduly influential observations (leverage above a particular threshold).

It would seem that all of these assumptions appear to have been violated (see below my diagnostic plots, Autocorrelation function plot, leverage plot). As a result I was wondering what would be the best assumption to tackle first or whether or not there is a generally accepted hierarchy as to which assumptions, if violated, should be rectified first.

Here are my diagnostic plots via plot(model1):

Here is my autocorrelation function plot:

Here is my plot of leverage for each data point with the threshold determining whether or not a datapoint is influential being set as 2p/n where p = no of independent variables and n = sample size.

Edit 1 - Here is the pACF plot for my model

this model looks pretty close to OK to me. 1) normality is violated, but your data seem to have lighter tails than the normal (actually, the data looks bounded?), so not a problem 2) What indicates heteroskedasticity to you? 3) ACF indeed indicates serial correlation, can you show us the pACF plot so we can suggest a time series model? 4) There are indeed some big leverage points, but the residuals associated seem to be small (bot-right plot off original four); we should double check by using out of sample/deleted resids instead. Finally, can you expand on how time is involved? — John Madden, Jan 16 '23 at 15:28
Oh yes and to answer your question at a high level, often the remedies for different assumptions violations will be conceptually independent, and so it doesn't matter "in what order" you do them, as we saw here. — John Madden, Jan 16 '23 at 15:30
@John Madden 1) I apologise if this is an obvious question but what do you mean by bounded data?, 2) In plot 3 (bottom left) there appears to be a pattern, 3) I have attached the pACF plot, 4) What do you mean by out of sample/ deleted residuals?, 5) There are currently no time-related variables in the model as the dependent variable was only measured once. — Insect_biologist, Jan 16 '23 at 15:56
"Time" is meant in a general sense of determining a sequence of observations that are expected to have properties akin to those of an approximately stationary time series. Otherwise, your use of a pacf is meaningless. — whuber, Jan 16 '23 at 17:10
@whuber Would you be able to explain this a little bit more please as I am not sure that I understand. Why would the concept of a time series be relevant to these analyses? Is it because not all of the data points would have been collected at the same time due to the independent variables influencing when the egg stage would end? — Insect_biologist, Jan 16 '23 at 17:30
I am saying the concept is not relevant unless it is! By exhibiting a pacf you have implicitly told us you think your data have a structure that is enough like a stationary time series that examining its autocorrelation function might be relevant. It doesn't matter whether the "time" index is time itself or any other variable. If this is a mystery to you, then why look at the pacf in the first place? — whuber, Jan 16 '23 at 17:41
@Insect_biologist 1) no, please excuse my use of jargon: by bounded I meant that there is a maximum and minimum possible duration that cannot be exceeded. 2) my eyes don't see a pattern there, FWIW 3/5) ACF plots are appropriate only for time-valued data: how did you create this ACF? i.e. what unit does the "Lag" on the x-axis take? If there is no time in your data, ACF plots are not meaningful as whuber mentioned. (this is good news for you in terms of simplifying the analysis :)) 4) in order to accurately gauge influence, we need to see how the fit changes if the observation is included... — John Madden, Jan 16 '23 at 17:42
... vs not included; this is what is meant by "deleted" (or "Out of Sample" in machine learning parlance). It seems one way this can be accomplished in R via this function: https://search.r-project.org/CRAN/refmans/olsrr/html/ols_plot_resid_stud_fit.html in the package olsrr (i haven't personally used this function before but seems straightforward enough) — John Madden, Jan 16 '23 at 17:43
PS @Insect_biologist did you see that praying mantis eat that bird? Pretty cool right? https://www.economist.com/science-and-technology/2023/01/11/a-praying-mantis-attacks-a-nestling — John Madden, Jan 16 '23 at 17:45
@whuber That makes sense. I was following a guide on testing model assumptions and I think I may have misinterpreted whether or not testing for autocorrelation is relevant to my data analyses. — Insect_biologist, Jan 16 '23 at 19:56
Sometimes autocorrelation appears in surprising ways. For instance, I once found it was important in an environmental soil sampling dataset because the locations were sampled in a zigzag pattern and the measurements were spatially correlated. Sometimes, then, people sort their data by the order in which they were collected (or according to some proxy for that order) and test for serial correlation. The standard approach is called the Durbin-Watson test. But when your data are in a meaningless order, such a test would usually be superfluous. A pacf is not a good general-purpose substitute. — whuber, Jan 16 '23 at 20:01
@JohnMadden 1) There is a minimum possible duration in that it cannot be less than 6hrs as individuals were checked for hatching every 12hrs (so time of hatching was taken as the mid-point between observation times). There wasn't a defined maximum though other than the fact that, amongst the latest hatchers, they would all have the same hatching time regardless of when they hatched within that same 12hrs. I apologize if I am confusing things. 2) So would we only reject the assumption of homoscedasticity in cases where there was a particularly strong trend in plot 1 (upper left)? — Insect_biologist, Jan 16 '23 at 20:07
@JohnMadden 3/5) I think I have come to the realization that this is perhaps not relevant to my analyses. I used the code sresid<-residuals(model1, type = "pearson") to produce standardized residuals then acf(sresid, main = "Auto-correlation plot") to produce the ACF plot as this was provided in my R handbook. 4) I will try the function in the olsrr in order to gauge influence. — Insect_biologist, Jan 16 '23 at 20:13
@JohnMadden Mantises are very interesting insects indeed and very cool to look at! — Insect_biologist, Jan 16 '23 at 20:14
@whuber In that case I am fairly certain that this assumption is not relevant to my analyses as the data was not collected in an an ordered way although I did try to run a Durbin-Watson test and got the following output so I am not sure what this means: lag Autocorrelation D-W Statistic p-value 1 0.893277 0.2054992 0 Alternative hypothesis: rho != 0 — Insect_biologist, Jan 16 '23 at 20:19
@Insect_biologist 1) gotcha 2) yes 3/5) indeed if your data are not ordered the results of time-series tests are not meaningful. 4) sounds good — John Madden, Jan 16 '23 at 20:55

How do I go about rectifying violated assumptions when more than one is violated at the same time?

0 Answers0