Finding a reliable P-value and pseudo $R^2$ for beta regression in R (betareg model)

Question

I'm having trouble with finding a reliable P-value for my beta regression in R. I thought it was as easy as reading it from the summary table, but as it turns out it is not. This website https://rcompanion.org/handbook/J_02.html tells me that the appropriate test to find the p-value for a betareg model is the lrtest from the lmtest package. According to other forums and websites the p_value (sjstats package), joint_tests (emmeans package), and nagelkerke (rcompanion package) could also be used.

Let me first introduce you to my model. I am investigating the influence of percentage overhead cover on the proportion of birds foraging. For this question I have made up three data sets, 1) realistic data, 2) random data, and 3) extreme data. I made these data sets because I wanted to check the P-values (for which I assume only the realistic data must have a significant P-value, if any). These data + their betareg models look like this.

My R script + output looks like this.

df <- data.frame(ProportionBirdsScavenging = c(0.666666666666667, 0.40343347639485, 0.7, 0, 0, 0.0454545454545455, 0.25, 0.707070707070707, 0.629213483146067, 0.882352941176471, 0.942857142857143, 0.451612903225806, 0.0350877192982456, 0.5, 0.484375, 0, 0.0208333333333333, 0.240740740740741, 0.804568527918782, 0.666666666666667, 1, 1, 1, 0.552238805970149, 0.2, 0, 0, 0, 0, 0, 0.12972972972973, 0.0894117647058824, 0.576158940397351, 0, 0),
                 pointWeight = c(3,233,10,89,4,22,44,99,89,17,35,341,57,36,128,39,144,54,394,12,46,229,55,67,5,28,2,160,124,294,555,425,302,116,48),
                 OverheadCover = c(0.7, 0.671, 0.6795, 0.79, 0.62, 0.62, 0.6413, 0.089, 0.4603, 0.04, 0.0418, 0.46, 0.5995, 0.532, 0.65, 0.6545, 0.74, 0.74, 0.02, 0.02, 0, 0, 0, 0.45, 0.8975, 0.92, 0.898, 0.89, 0.86, 0.69, 0.755, 0.775, 0.585, 0.585, 0.55),
                 Random_OverheadCover = c(0.3021, 0.6397, 0.7437, 0.9408, 0.9532, 0.4218, 0.8518, 0.2117, 0.5945, 0.531, 0.7656, 0.8952, 0.3127, 0.5513, 0.5619, 0.6332, 0.0051, 0.0388, 0.1877, 0.964, 0.2192, 0.4307, 0.5684, 0.062, 0.071, 0.0733, 0.6637, 0.5746, 0.4933, 0.4182, 0.8617, 0.7269, 0.7009, 0.655, 0.7696),
                 Extreme_OverheadCover = c(1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1))

# REALISTIC DATA
df_realistic <- df
df_realistic$ProportionBirdsScavenging <- (((df_realistic$ProportionBirdsScavenging*(length(df_realistic$ProportionBirdsScavenging)-1))+0.5)/length(df_realistic$ProportionBirdsScavenging)) # Transform the data so all data is (0,1).
mybetareg_realistic <- betareg(ProportionBirdsScavenging ~ OverheadCover, data = df_realistic, weights = pointWeight, link = "logit")
summary(mybetareg_realistic)
# Call:
#   betareg(formula = ProportionBirdsScavenging ~ OverheadCover, data = df_realistic, weights = pointWeight, 
#           link = "logit")
# 
# Standardized weighted residuals 2:
#   Min       1Q   Median       3Q      Max 
# -27.4104  -6.9053   1.5775   5.3971  27.4439 
# 
# Coefficients (mean model with logit link):
#   Estimate Std. Error z value Pr(>|z|)    
# (Intercept)    1.69853    0.02563   66.26   <2e-16 ***
#   OverheadCover -4.28684    0.04365  -98.22   <2e-16 ***
#   
#   Phi coefficients (precision model with identity link):
#   Estimate Std. Error z value Pr(>|z|)    
# (phi)   8.7909     0.1861   47.23   <2e-16 ***
#   ---
#   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
# 
# Type of estimator: ML (maximum likelihood)
# Log-likelihood:  3294 on 3 Df
# Pseudo R-squared: 0.6607
# Number of iterations: 12 (BFGS) + 1 (Fisher scoring)

parameters::p_value(mybetareg_realistic)
# Parameter p
# 1   (Intercept) 0
# 2 OverheadCover 0
# 3         (phi) 0

nagelkerke(mybetareg_realistic)
# $Models
# 
# Model: "betareg, ProportionBirdsScavenging ~ OverheadCover, df_realistic, pointWeight, logit"
# Null:  "betareg, ProportionBirdsScavenging ~ 1, df_realistic, pointWeight, logit"            
# 
# $Pseudo.R.squared.for.model.vs.null
# Pseudo.R.squared
# McFadden                         -2.69615e+00
# Cox and Snell (ML)                1.00000e+00
# Nagelkerke (Cragg and Uhler)     -1.53209e-24
# 
# $Likelihood.ratio.test
# Df.diff LogLik.diff  Chisq p.value
# -1     -2587.3 5174.6       0
# 
# $Number.of.observations
# 
# Model: 35
# Null:  35
# 
# $Messages
# [1] "Note: For models fit with REML, these statistics are based on refitting with ML"
# 
# $Warnings
# [1] "None"
# 
# Warning message:
#   In betareg.fit(X, Y, Z, weights, offset, link, link.phi, type, control) :
#   no valid starting value for precision parameter found, using 1 instead

lrtest(mybetareg_realistic)
# Likelihood ratio test
# 
# Model 1: ProportionBirdsScavenging ~ OverheadCover
# Model 2: ProportionBirdsScavenging ~ 1
# #Df LogLik Df  Chisq Pr(>Chisq)    
# 1   3 3546.9                         
# 2   2  959.6 -1 5174.6  < 2.2e-16 ***
#   ---
#   Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# Warning message:
#   In betareg.fit(X, Y, Z, weights, offset, link, link.phi, type, control) :
#   no valid starting value for precision parameter found, using 1 instead

joint_tests(mybetareg_realistic)
# The output is just 3 white lines.



# RANDOM DATA
df_random <- df 
df_random$ProportionBirdsScavenging <- (((df_random$ProportionBirdsScavenging*(length(df_random$ProportionBirdsScavenging)-1))+0.5)/length(df_random$ProportionBirdsScavenging))
mybetareg_random <- betareg(ProportionBirdsScavenging ~ Random_OverheadCover, data = df_random, weights = pointWeight, link = "logit")
# Warning message:
#   In betareg.fit(X, Y, Z, weights, offset, link, link.phi, type, control) :
#   no valid starting value for precision parameter found, using 1 instead

summary(mybetareg_random)
# Call:
#   betareg(formula = ProportionBirdsScavenging ~ Random_OverheadCover, data = df_random, weights = pointWeight, 
#           link = "logit")
# 
# Standardized weighted residuals 2:
#   Min       1Q   Median       3Q      Max 
# -24.6579  -7.4615   0.8881   5.0596  31.2095 
# 
# Coefficients (mean model with logit link):
#   Estimate Std. Error z value Pr(>|z|)    
# (Intercept)          -0.07481    0.04363  -1.715   0.0864 .  
# Random_OverheadCover -0.73184    0.06930 -10.561   <2e-16 ***
#   
#   Phi coefficients (precision model with identity link):
#   Estimate Std. Error z value Pr(>|z|)    
# (phi)  1.37018    0.02466   55.57   <2e-16 ***
#   ---
#   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
# 
# Type of estimator: ML (maximum likelihood)
# Log-likelihood:  1015 on 3 Df
# Pseudo R-squared: 0.01213
# Number of iterations: 8 (BFGS) + 2 (Fisher scoring)

parameters::p_value(mybetareg_random)
# Parameter            p
# 1          (Intercept) 8.638446e-02
# 2 Random_OverheadCover 4.509897e-26
# 3                (phi) 0.000000e+00

nagelkerke(mybetareg_random)
# $Models
# 
# Model: "betareg, ProportionBirdsScavenging ~ Random_OverheadCover, df_random, pointWeight, logit"
# Null:  "betareg, ProportionBirdsScavenging ~ 1, df_random, pointWeight, logit"                   
# 
# $Pseudo.R.squared.for.model.vs.null
# Pseudo.R.squared
# McFadden                         -5.73902e-02
# Cox and Snell (ML)                9.57020e-01
# Nagelkerke (Cragg and Uhler)     -1.46624e-24
# 
# $Likelihood.ratio.test
# Df.diff LogLik.diff  Chisq    p.value
# -1     -55.073 110.15 9.1055e-26
# 
# $Number.of.observations
# 
# Model: 35
# Null:  35
# 
# $Messages
# [1] "Note: For models fit with REML, these statistics are based on refitting with ML"
# 
# $Warnings
# [1] "None"
# 
# Warning message:
#   In betareg.fit(X, Y, Z, weights, offset, link, link.phi, type, control) :
#   no valid starting value for precision parameter found, using 1 instead

lrtest(mybetareg_random)
# Likelihood ratio test
# 
# Model 1: ProportionBirdsScavenging ~ Random_OverheadCover
# Model 2: ProportionBirdsScavenging ~ 1
# #Df  LogLik Df  Chisq Pr(>Chisq)    
# 1   3 1014.69                         
# 2   2  959.62 -1 110.15  < 2.2e-16 ***
#   ---
#   Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# Warning message:
#   In betareg.fit(X, Y, Z, weights, offset, link, link.phi, type, control) :
#   no valid starting value for precision parameter found, using 1 instead

joint_tests(mybetareg_random)
# Again, just 3 white lines as output.



# EXTREME DATA
df_extreme <- df
df_extreme$ProportionBirdsScavenging <- (((df_extreme$ProportionBirdsScavenging*(length(df_extreme$ProportionBirdsScavenging)-1))+0.5)/length(df_extreme$ProportionBirdsScavenging))
mybetareg_extreme <- betareg(ProportionBirdsScavenging ~ Extreme_OverheadCover, data = df_extreme, weights = pointWeight, link = "logit")
# Warning message:
#   In betareg.fit(X, Y, Z, weights, offset, link, link.phi, type, control) :
#   no valid starting value for precision parameter found, using 1 instead

summary(mybetareg_extreme)
# Call:
#   betareg(formula = ProportionBirdsScavenging ~ Extreme_OverheadCover, data = df_extreme, weights = pointWeight, 
#           link = "logit")
# 
# Standardized weighted residuals 2:
#   Min       1Q   Median       3Q      Max 
# -21.1256  -8.0601   0.2244   5.1278  31.6603 
# 
# Coefficients (mean model with logit link):
#   Estimate Std. Error z value Pr(>|z|)    
# (Intercept)           -0.43280    0.02689  -16.09  < 2e-16 ***
#   Extreme_OverheadCover -0.11404    0.03667   -3.11  0.00187 ** 
#   
#   Phi coefficients (precision model with identity link):
#   Estimate Std. Error z value Pr(>|z|)    
# (phi)  1.33763    0.02393   55.91   <2e-16 ***
#   ---
#   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
# 
# Type of estimator: ML (maximum likelihood)
# Log-likelihood: 964.5 on 3 Df
# Pseudo R-squared: 0.01569
# Number of iterations: 8 (BFGS) + 2 (Fisher scoring) 

parameters::p_value(mybetareg_extreme)
# Parameter            p
# 1           (Intercept) 2.868274e-58
# 2 Extreme_OverheadCover 1.871710e-03
# 3                 (phi) 0.000000e+00

nagelkerke(mybetareg_extreme)
# $Models
# 
# Model: "betareg, ProportionBirdsScavenging ~ Extreme_OverheadCover, df_extreme, pointWeight, logit"
# Null:  "betareg, ProportionBirdsScavenging ~ 1, df_extreme, pointWeight, logit"                    
# 
# $Pseudo.R.squared.for.model.vs.null
# Pseudo.R.squared
# McFadden                         -5.03454e-03
# Cox and Snell (ML)                2.41241e-01
# Nagelkerke (Cragg and Uhler)     -3.69602e-25
# 
# $Likelihood.ratio.test
# Df.diff LogLik.diff  Chisq   p.value
# -1     -4.8312 9.6625 0.0018807
# 
# $Number.of.observations
# 
# Model: 35
# Null:  35
# 
# $Messages
# [1] "Note: For models fit with REML, these statistics are based on refitting with ML"
# 
# $Warnings
# [1] "None"
# 
# Warning message:
#   In betareg.fit(X, Y, Z, weights, offset, link, link.phi, type, control) :
#   no valid starting value for precision parameter found, using 1 instead

lrtest(mybetareg_extreme)
# Likelihood ratio test
# 
# Model 1: ProportionBirdsScavenging ~ Extreme_OverheadCover
# Model 2: ProportionBirdsScavenging ~ 1
# #Df LogLik Df  Chisq Pr(>Chisq)   
# 1   3 964.45                        
# 2   2 959.62 -1 9.6625   0.001881 **
#   ---
#   Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# Warning message:
#   In betareg.fit(X, Y, Z, weights, offset, link, link.phi, type, control) :
#   no valid starting value for precision parameter found, using 1 instead

joint_tests(mybetareg_extreme)
# model term            df1 df2 F.ratio p.value
# Extreme_OverheadCover   1 Inf   9.668 0.0019

As far I can get from these outcomes we have the following P-values.

Realistic data

summary <2e-16

p_value 0

nagelkerke 0

lrtest < 2.2e-16

joint_tests no output

Random data

summary <2e-16

p_value 4.509897e-26

nagelkerke 9.1055e-26

lrtest < 2.2e-16

joint_tests No output

Extreme data

summary 0.00187

p_value 1.871710e-03

nagelkerke 0.0018807

lrtest 0.001881

joint_tests 0.0019

According to these outcomes, all P-values are significant (<.05). I think this is quite curious, given the fact that I've deliberately made up data sets which either do or do not show a relation. Hence, the following questions.

1) Which test is appropriate to find the p-value for my model? Is the pseudo R2 from the summary table reliable, or should I take that from an other test as well?

2) Should I look for a P-value, or is there a better coefficient which tells me the goodness of the model? Which coefficients should I report when writing up this research?

3) What does the following warning mean? Why does it only sometimes pop up? For example, if I run my entire script first and then do the tests again without removing the objects from the environment, the warning does not pop up.

Warning message:
   In betareg.fit(X, Y, Z, weights, offset, link, link.phi, type, control) :
   no valid starting value for precision parameter found, using 1 instead

4) Why is there only an output for the joint_tests for the extreme data?

Sorry for the many questions, but I am in the deep here. Can somebody shed some light on this and help me?

joint_tests reports results only for categorical factors as far as I know. — Sal Mangiafico, Dec 13 '19 at 01:45
It may also be worthwhile to assess the weights used in this analysis. There is some discussion on how betareg handles weights at this Cross Validated page. It may be that these case weights aren't what you want, but instead that you want to use proportional weights. So, using something like follows, you will get the same coefficients in the results, but different p values: df_random$Weight = df_random$pointWeight / sum(df_random$pointWeight). — Sal Mangiafico, Dec 13 '19 at 15:49

score 2 · Accepted Answer · answered Dec 13 '19 at 01:00

Extracting the p-values via summary() or via parameters::p_value() seems to compute the same p-values, they might just get formatted slightly differently. These are the p-values from the asymptotic Wald tests for each individual regression coefficient, using the covariance matrix estimate based on the Hessian.

The lrtest() gives very similar p-values to the marginal Wald test of the overhead cover coefficient but somewhat different. The reason is that lrtest() computes the asymptotic likelihood ratio test rather than the Wald test. Asymptotically, these will agree but in finite samples they differ somewhat.

The reason that the p-values are spuriously significant in the random case seems to be due to a poorly fitting model with a relatively low sample size. Hence, the asymptotics that the tests rely on do not work. Simply using a sandwich or bootstrap covariance (via sandwich() or vcovBS() from the sandwich package) or bootstrap inference (via Boot() from the car package) shows that the effect is actually not significant. As an example:

coeftest(mybetareg_random, vcov = sandwich)
## z test of coefficients:
## 
##                       Estimate Std. Error z value  Pr(>|z|)    
## (Intercept)          -0.074812   0.616474 -0.1214    0.9034    
## Random_OverheadCover -0.731838   0.790655 -0.9256    0.3546    
## (phi)                 1.370185   0.326463  4.1971 2.704e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Similarly:

set.seed(1)
coeftest(mybetareg_random, vcov = vcovBS)
## z test of coefficients:
## 
##                       Estimate Std. Error z value Pr(>|z|)   
## (Intercept)          -0.074812   0.707324 -0.1058  0.91577   
## Random_OverheadCover -0.731838   0.903038 -0.8104  0.41770   
## (phi)                 1.370185   0.515240  2.6593  0.00783 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

A final alternative would be to completely switch to the bootstrap (and not just bootstrap covariance) via Boot(mybetareg_random) which also leads to similar insights

The warning message simply states that the usual strategy for finding starting values in the numerical optimization of the log-likelihood did not work. Another strategy is used in that case which may lead to convergence problems more often. However, in this case the optimization still converges successfully. Despite converging this may still indicate that the model fits the data rather poorly which is the case here.

As for the remaining questions, I feel they are too broad (how to best report regression results, using pseudo R-squared vs. p-values, etc.) to be answered together with the questions above. Similarly, joint_tests() is a function from a different package and not betareg, hence I feel these shouldn't be mixed together here.

Finding a reliable P-value and pseudo $R^2$ for beta regression in R (betareg model)

1 Answers1