3

I have previously posted a question for the same dataset but now I had somme issues with the models and I wanted to re-phrase my question.

My dataset contains 50 morphometric characters (which we reduced with factor analysis or pca to few common components) measured on the roe deer skull. I want a model for predicting skull dimensions from the absolute areas of the forest and plowland in habitats they live in. My first idea was to enter absolute areas as predictors but I was advised to use them as proportions and to add population abundance as a predictor in the model.

I know that GLM (with binomial error structure) is great for proportion data but I am unsure how to specify such a model in R. In fact I am also unsure how to enter proportions (I can calculate percentages, but am wondering if they must total to 1, 100%).

Any ideas?

SAMPLE DATASET

   Factor1 population manage foraging height biome abundance area  forest plough 
 -0.6033788 ADA_BEC   best    fields   low    agS    1500    73154  61154  12000
  0.3250981 ADA_BEC   best    fields   low    agS    1500    73154  61154  12000
  0.5577059 ADA_BEC   best    fields   low    agS    1500    73154  61154  12000
 -0.1596194 PM        good    plains   med    kgS    23980   856251 89499  579870
 -1.3089952 PM        good    plains   med    kgS    23980   856251 89499  579870
 -2.1693392 SP        poor    mount    high   hgS    2500    65872  38000  47098
 -0.9669080 SP        poor    mount    high   hgS    2500    65872  38000  47098
 -1.8857842 SP        poor    mount    high   hgS    2500    65872  38000  47098
  0.7242678 DKN       best    fields   plain  agS    65908   989981 181133 12400
  1.6815373 DKN       best    fields   plain  agS    65908   989981 181133 12400

Area is the total area and forest is the forest and plough area (which don`t add up to the total area as there are meadowlands and urban areas, we have this data too). These areas are in expressed in Ha. Factor1 are factor axis scores and abundance is the total number although we have density (individual per area) also.

Fedja Blagojevic
  • 597
  • 4
  • 15

1 Answers1

7

It looks to me like your responses are skull-related measurements and not counts, in which case there's no need for binomial GLM. Further, if your response was a proportion, such as the fraction of an area, you're not dealing with a binomial (a count), you're dealing with compositional data.

If you have either binomial counts or compositional proportions as predictors (independent variables), they (or some suitable transform of them, were that necessary) can just enter a model as predictors, as long as you're able to condition on their values (rather than worry about errors-in-variables). No special "model" is necessary to incorporate them - you need to worry about how to model your response.

In your particular case, you might have a model that had each of the two areas or you might have a model that had one of the two proportions as well as the overall area.

You wouldn't have both percentages as predictors, because they'd be multicollinear.


Side-answer (a) - When the response is binomial

There are several ways to input binomial data into a GLM in R

You can input a two column response (counts of successes, counts of failures).

The responses can be entered as a factor, where the first level of the factor is success and all other levels are failures; this is handy when you have data by individual.

It can be entered as a proportion of successes, with the number of trials in the prior weights. This might be the one you want but you should be able to make it work for any of the forms.


Side answer (b) - When the response is a proportion:

With compositional data with two components, the usual approach is beta regression. though there are other approaches. As an example of something that does this, see the package betareg on CRAN. Here's a link to its cran page. The vignettes would be a place to start.

(With $k$ components, its Dirichlet a generalization of the beta. e.g. see DirichletReg on CRAN.)

Reference related to the betareg package:

Cribari-Neto, F. and Zeileis, A. (2010), Beta Regression in R,
Journal of Statistical Software, volume 34, issue 2, April

Reference to beta regression and another approach:

$\quad$ Kieschnick, R. and McCullough, BD (2003),
Regression analysis of variates observed on (0, 1): percentages, proportions and fractions
Statistical Modelling; 3: 193–213

http://www.pages.drexel.edu/~bdm25/statmodelling.pdf

Older reference to Dirichlet and transformed normal models, with some underlying technical details:

$\quad$ Aitchison, J. (1982)
The Statistical Analysis of Compositional Data
Journal of the Royal Statistical Society. Series B (Methodological), Vol. 44, No. 2., pp. 139-177.

http://leg.est.ufpr.br/lib/exe/fetch.php/pessoais:abtmartins:thestatisticalanalysisofcompositionaldata.pdf

I have seen still other approaches.

Glen_b
  • 282,281
  • I have pasted the sample dataset, that is in reality much much larger. I hope it is clear and that code can be worked out. – Fedja Blagojevic Apr 12 '13 at 10:43
  • How to translate areas into proportions? and enter them in a GLM? – Fedja Blagojevic Apr 12 '13 at 10:56
  • "How to translate areas into proportions?" ... you don't. Your response isn't area, surely? If it is, that isn't a count... and so it's not binomial. – Glen_b Apr 12 '13 at 11:16
  • 2
    Rereading your question, your responses are 'skull dimensions'. That's not binomial. (Binomials are counts). It doesn't matter if your predictors are anything in particular. – Glen_b Apr 12 '13 at 11:50
  • Yes i should re-phrase my question. If we say that one area consists of i.e. 36% forest to 64% plowland, how could that be entered in a correct statistical model? – Fedja Blagojevic Apr 12 '13 at 12:43
  • I have edited some discussion into my answer. You might have a model that had the two areas or you might have a model that had one of the two proportions as well as the overall area. With 'skull dimensions' I'd have been tempted to work on a log scale, but once converted to a factor, you're pretty much stuck with the factor I think. Am I correct in thinking Factor1 is the intended response? – Glen_b Apr 12 '13 at 22:58
  • Yes Factor 1 individual scores are the intended response, but also other factors in separate analyses. Since we have 50 variables we need this data reduction step so we can interpret the observed variability along the axes of variation and enter them in the models. – Fedja Blagojevic Apr 14 '13 at 10:43
  • I've only just noticed that the proportions don't necessarily add to 100%. If they're usually not close to 100%, it would be reasonable to include total area and both proportions, OR to consider forest, plough, and "other" which is the same as "area - forest - plough" ... but you need to make sure they're not too dependent or you'll need to drop one. – Glen_b Apr 15 '13 at 01:15
  • Thanks a lot. I have tried DirichletReg package by reversing and using proportions (of which I can find other area (meadowland) so that they add up to the total area) as response variable and PC1 scores as predictors. I hope this is OK. – Fedja Blagojevic Apr 15 '13 at 16:05
  • What would make it okay is if something to do with that model corresponds to your questions of interest. I got the impression that you were interested in differences in skull dimensions in terms of areas/proportions as predictors, which would suggest this model doesn't really answer your specific question. However, it's possible I haven't understood your circumstance correctly. The thing to start with is to be explicit about the questions you wish to answer - appropriate analyses will come as a consequence of that. – Glen_b Apr 15 '13 at 22:29