0

I am trying to see if some anthropic variables (e.g., Population density, Population growth, and Roads) explain animals' distribution. My dependent variable is the percentage of area occupied (continuous variable, range: 0-100; e.g., 87.9). Each species has also a status (discrete variable; i.e., 0/1) and an Order (more than one species can be of the same Order; e.g., Rodentia), and I expect to have different influences of each independent variable, based on species' status. For instance, for a species with status 0, I expect no influence of Roads, while for a species with status 1, I expect an influence of Roads.
The values of the independent variables are measured throughout the species distribution range. Some of these independent variables can have a negative value (Population Growth). I first looked into GLMM as I thought they were more flexible. However, I don't really understand which could be my grouping variable or my random effect(s). If I understood correctly, the grouping variable may be something "grouping" my data, that is: based on this grouping, the response variable shall behave differently. In my case, it could be the Country (as different countries may have different population densities or number of roads), but I don't have this information (as the species' distributions encompass more than one country), and also the independent variables are not measured at a country level but are averaged over each species' distribution range.

I switched to GLM (package stats, I don't know if there are better options) thinking it would be better. I think I should use a Gamma distribution, based on the response variable (that is > 0 and is continuous), but I am wondering if my choices are correct and how to deal with negative values in the independent variables. So far I've tried this:

glm(range_perc ~ 
            PopdensityAvg + PopgrowthAvg + Railways + Roads,
            start = c(range(myData$range_perc)[1],
                      range(myData$PopdensityAvg)[1],
                      range(myData$PopgrowthAvg)[1],
                      range(myData$Railways)[1],
                      range(myData$Roads)[1]),
            data = myData,
            family = Gamma)

This is a sample of my data:

                   Binomial           Order Establishment range_perc PopdensityAvg PopgrowthAvg  Railways    Roads
1       Apodemus_sylvaticus        Rodentia             1   20.04902      4.908014    0.2019391 0.0000000 2.983818
2       Apodemus_sylvaticus        Rodentia             0  100.00000     36.353451    1.2507490 0.2885747 2.728217
3                 Axis_axis Cetartiodactyla             0  100.00000     61.042892    0.2815404 0.0000000 4.964841
4   Callosciurus_erythraeus        Rodentia             1   97.82241    329.174194   10.8665762 1.5212460 2.964914
5   Callosciurus_erythraeus        Rodentia             0  100.00000    338.821411   11.4654551 1.5289692 3.000512
6  Callosciurus_finlaysonii        Rodentia             0  100.00000    155.620636    1.4710270 2.0869565 6.450338
7            Capra_aegagrus Cetartiodactyla             1   24.25978    142.892624   13.3291273 0.5820163 2.345860
8            Capra_aegagrus Cetartiodactyla             0  100.00000     78.786888    5.9212909 0.1487832 3.228113
9                Capra_ibex Cetartiodactyla             1   77.20798     27.929804   -0.3536499 0.2243542 2.430885
10               Capra_ibex Cetartiodactyla             0  100.00000     26.661007   -0.1254798 0.3145299 2.327852

I am using R 4.0.3.

LT17
  • 141
  • Do you have one data point per species? Since the outcome variable is a proportion perhaps you were thinking about beta regression? – dipetkov Aug 10 '22 at 18:47
  • If you mean points with coordinates for data points, no, I don't have data points. The proportion is the proportion of the area occupied. I guess it can be transformed to range from 0 to 1 for the beta regression, but if there are assumptions to satisfy (e.g., normality), I don't think these data will fit... – LT17 Aug 10 '22 at 18:58
  • By data points I mean rows in the data table. How many rows/observations do you have for each species? I ask because you say "independent variables (...) are averaged over each species' distribution range". – dipetkov Aug 10 '22 at 19:02
  • @LisaTedeschi I second the suggestion to use beta regression. The betareg package has a really nice vignette. I suggest you give it a read. A beta-distribution is way more appropriate for percentage data than a Gamma-distribution. – Roland Aug 11 '22 at 06:41
  • @dipetkov for each species I may have one (e.g. species in row 6) or two (the others). If there are two points, they are different (one has establishment status 0, the other has 1), and the distribution range is different as well, thus also the proportion is different. – LT17 Aug 11 '22 at 08:26
  • Thanks @Roland, I am reading the descriptions now. Why would you recommend a beta regression over a GLM with Gamma distribution? I mean, in which way is it more accurate? – LT17 Aug 11 '22 at 08:28
  • Percentage data is bound by zero and one. The beta distribution is non-zero in the interval (0,1). The Gamma distribution is non-zero for positive numbers. It should be obvious why the beta distribution is better suited. (You might have some issues if you have exact zeros and ones in your data.) – Roland Aug 11 '22 at 08:31
  • @Roland I actually have many 0 and 1 in my data (as you can see from the sample as well, where 6 out of 10 myData$perc_range have a value of 100). From an answer here https://stats.stackexchange.com/questions/31300/dealing-with-0-1-values-in-a-beta-regression , "if y also assumes the extremes 0 and 1, a useful transformation in practice is (y * (n−1) + 0.5) / n where n is the sample size", but this transforms my 100% myData$range_perc in a 95%... – LT17 Aug 11 '22 at 08:50
  • Then you might be better suited with logistic regression. Do you have the total areas? – Roland Aug 11 '22 at 09:23
  • @Roland yes, I have it, I just removed it from the sample data after calculating the myData$range_perc. But isn't logistic regression for 0 and 1 response variables (success/failure, Y/N, male/female...)? myData$Establishment isn't my response variable. I would try to transform myData$range_perc in a proportion (ranging from 0 to 1 included) and perform a GLM with binomial distribution, but even if it is a proportion and it ranges from 0 to 1, it is not a proportion of successful cases, so I guess it is not appropriate... – LT17 Aug 11 '22 at 09:43

0 Answers0