GLM or arcsine and two-way ANOVA

Question

I am trying to analyse data on how long deer have been vigilant in a 2 minute observational period and how this varies between males and females and also whether they were in the centre or edge of the herd.

My layout is sort of like this just to give an idea:

Centre of herd
Male |Female
14.5 |32.4 

Edge of herd
Male|Female
63.5|54.2

I have proportional data for the time spent vigilant (out of 120 seconds) and have been told to arcsine transform this and then perform a two way ANOVA.

However I've been reading and hearing a lot of bad things about this and was wondering what other way I can do this and how I do this in R?

Please try and be clear and basic, I'm very new to stats and R!

Is the example data the group averages, or are the the numbers from just 4 animals? How much data do you have? Are your data a one-shot observation, or did you follow them for a while & get multiple timings under different conditions? — gung - Reinstate Monica, Jun 20 '15 at 20:29
@gung the example data given here are the group averages, I have 102 observations (46 females and 56 males), and the data are one-shot observations of different deer (so 102 deers observed for 2 minutes each) — Jessica, Jun 20 '15 at 20:55
@gung is an arcsine transformation on the proportional data (vigilance) and then analysis using ANOVA appropriate for this data? — Jessica, Jun 20 '15 at 21:27

score 2 · Accepted Answer · edited Apr 13 '17 at 12:44

Routine use of the arcsine transformation is rightly criticized. However, using data transformations to meet analysis assumptions is quite common and perfectly fine. The arcsine is just another data transformation, so if it allows you to meet the required assumptions of your analysis, you can use it. Just don't blindly apply it to your data just because they are proportions.

The ANOVA assumes the data are normally distributed within each cell of your design. However, ANOVA's are fairly robust, and you are going to have a reasonable amount of data per cell (>20), so you are probably OK if the data are not too far from normality. Strong skews in different directions coupled with largely differing standard deviations would be the most damaging. As a rule of thumb (take this for what it's worth), if the ratio of the largest cell SD to smallest cell SD is $\le2$, and none of the skewnesses is $>1$, you might be fine. If your data are not very normal, and/or have different SDs, you can try some transformations (including the arcsine) to see if that will help.

The advice above is not truly ideal, but may well be good enough. I emphasize that approach as you say you are very new to this and without resources to rely on. Note that logistic regression (the GLiM I suspect you have in mind) is not appropriate for your data. Logistic regression assumes the distribution of your response variable is binomial. That is, it is a discrete number of 'successes' out of a known number of 'trials'. You have a continuous proportion—it is not meaningful to consider each second of vigilance or not as a unique trial (although you have discretized it by rounding to seconds, but this isn't that relevant).

For more sophisticated options, you could look into beta regression or ordinal logistic regression. These are both rather advanced; if you are struggling with ANOVAs and (binary) logistic regression, these are probably a bridge too far.

The beta distribution is the most common distribution for continuous proportions. Beta regression is a regression model that assumes that the response is distributed as beta. Technically, this is not a generalized linear model (it's close enough that you can use many of the same ideas to understand it, though). There is the betareg package in R for fitting these; I have a small example here: Remove effect of a factor on continuous proportion data using regression in R.

Ordinal logistic regression is a non-parametric option. All you need to assume is that your response data are meaningfully ordered (i.e., that $14.5\% < 32.4\%$). Many people flummoxed by using OLR with continuous data (it is fine) and so you may be called on to defend this choice. That fact, in addition to its more advanced nature and the greater sophistication required to interpret it, may make it less practical for you. Note that OLR is a GLiM. There are several ways to fit OLRs in R, the excellent UCLA statistics help site has a tutorial here.

GLM or arcsine and two-way ANOVA

1 Answers1