2

I am getting very different results for a negative binomial model between R and SAS. Can you please suggest as to why is this happening? I am not able to add a csv file used as the data.

R script:

NB_Total<- glm.nb(Total_Crashes~Ln_AADT + offset(Ln_LaneMiles), weight = LANEMILES, 
data=SPF_data_test)  
summary(NB_Total)

R model estimates:

             Estimate Std. Error z value Pr(>|z|) 
(Intercept) -5.962671 0.018471   -322.8  <2e-16 *** 
Ln_AADT      0.738122 0.002503    294.9  <2e-16 ***

SAS script:

proc genmod data=V1 ;
weight lanemiles; 
model Total_Crashes=ln_aadt /dist=NEGBIN link=log type3 offset=Ln_LaneMiles;        run;

SAS model estimates:

Parameter Estimate Standard Error Pr > ChiSq 
Intercept -10.3805 0.1657         <.0001 
Ln_AADT    0.7443  0.0212         <.0001 

Update: I checked with and without the weight factor on both R and SAS. Without the weight factor (i.e. only Ln_AADT as the covariate and Ln_LaneMiles as the offset), the results are identical (i.e. the parameter coefficients and standard errors are the same on both, and the inverse of the dispersion parameter on R is equal to the dispersion on SAS). However, the difference in results is happening when I am adding the weight factor. I could not find any documentation yet that explains how R includes the weight in the estimation. 

Meg
  • 63
  • 4
    Not all of us are familiar with both the programming languages in question. What does "log type 3" mean in the SAS link statement? In one model I see 'model &acc=ln_aadt...and in the other I seeTotal_Crashes~Ln_AADT; isaccthe same asTotal_Crashes`? Also, those results don't look very different to me, especially w/o standard errors to provide a context of what "very different" might be from an estimation point of view. If you could expand on what the models are, in mathematical formulation, that might help too. – jbowman Jun 07 '22 at 03:01
  • 3
    There are two parameterizations for a negative binomial distribution. Sone count the trial on which the $r$th S is observed and some count the number of F's before the $rth$ S is observed. If SAS uses one parameterization and R uses another, that might account for whatever difference you see. (R uses the latter.) – BruceET Jun 07 '22 at 03:23
  • Hi @jbowman, yes, you're right. I changed the variable name from "acc" to "Total_Crashes". They are the same variable. Yes, I am trying to understand how mathematically the estimation is being different? Although, the estimates do not look drastically different, and so do not the standard errors, the difference between the total observed vs total predicted, or in other words, total residuals of all observations are quite different. – Meg Jun 07 '22 at 04:19
  • Parameter Estimate Standard Error Pr > ChiSq Intercept -10.3805 0.1657 <.0001 Ln_AADT 0.7443 0.0212 <.0001 Above is SAS output, while below is R output

    Coefficients: Estimate Std. Error z value Pr(>|z|)
    (Intercept) -5.962671 0.018471 -322.8 <2e-16 *** Ln_AADT 0.738122 0.002503 294.9 <2e-16 ***

    – Meg Jun 07 '22 at 04:22
  • Thank you so much, Dr. Bruce. I am not originally from the statistics background, and not sure if I could follow your explanation correctly. Can you please suggest some documentation/reference which would help me understand the difference in parameterization? Thank you very much. – Meg Jun 07 '22 at 04:28
  • 1
    A reproducible dataset would be helpful. Things to check: initial conditions, and glm control functions between R and SAS. – AdamO Jun 07 '22 at 04:36
  • I am not able to attach any csv file to this question. I wish I could. However, in either case, I am not providing any init.theta value. Also, I could see the control function on R for the glm.nb model as glm.control(epsilon = 1e-8, maxit = 25, trace = FALSE), but do not know how to check that for SAS. – Meg Jun 07 '22 at 04:49
  • 2
    Apparently proc genmod and proc glm use different model parameterizations and generate different estimates as a result; you might try proc glm and see if its results are the same as R or not. – jbowman Jun 07 '22 at 05:21
  • 1
    Another reason for discrepancies that has cropped up in similar questions is that errors in reading and processing the data might have occurred. What have you done to confirm both programs are working with exactly the same data? – whuber Jun 07 '22 at 16:17
  • 1
    Hi, I got the summary of the data and compared that on both platforms. – Meg Jun 08 '22 at 15:49
  • @jbowman, can you please suggest any reference where I can find these difference in mathematical expressions? I will try with proc glm as well – Meg Jun 08 '22 at 16:14
  • Unfortunately not; I found a post from someone asking why their model generated different results using the two procs, and the response (from SAS, if I recall correctly) was that the underlying parameterization was different, so that's all I know. – jbowman Jun 08 '22 at 16:19
  • It appears you're using 'lane miles' as a weight and it's logarithm as an offset - why? Anyway: (1) Check the data. (2) Make sure you understand what models you're fitting - read the manuals (R's manual is scanty, but I summarize the essentials from MASS here). Two things that can vary between negative binomial models are the link & the mean-variance relation; so pay attention to those, though R uses an NB2 model with log link by default & I doubt SAS does otherwise. The parameterization won't make any difference ... – Scortchi - Reinstate Monica Jun 10 '22 at 20:53
  • ... to predictions, but you ought to understand it - the 'dispersion' parameter R gives is the reciprocal of what's often used. Are weights & offsets defined equivalently? (3) Confirm that SAS & R use the same estimation method. For fixed 'dispersion' a negative binomial regression is a generalized linear model: R iterates GLMs to find the maximum-likelihood estimate of all parameters including dispersion, but I don't know what SAS does by default. (4) Check convergence & find out how to change the stopping criteria - it could just be numerical issues. – Scortchi - Reinstate Monica Jun 10 '22 at 20:53
  • NB standard errors will be different: see https://stats.stackexchange.com/q/221648/17230 – Scortchi - Reinstate Monica Jun 10 '22 at 21:55
  • Thank you, @Scortchi - Reinstate Monica! I updated my question as below. "I checked with and without the weight factor on both R and SAS. Without the weight factor (i.e. only Ln_AADT as the covariate and Ln_LaneMiles as the offset), the results are identical (i.e. the parameter coefficients and standard errors are the same on both, and the inverse of the dispersion parameter on R is equal to the dispersion on SAS). However, the difference in results is happening when I'm adding the weight factor. I could not find any documentation yet that explains how R includes the weight in the estimation." – Meg Jun 12 '22 at 03:30
  • In R's glm function they're prior weights - the scale (aka dispersion, unfortunately) parameter is divided by an observation weight at each occurrence in the likelihood formula (see MASS, Ch. 7, p 183). Note the scale parameter will be fixed at 1 for negative binomial models. Weights are defined the same way in the documentation of SAS's GENMOD. Do you get a discrepancy using a different model - say a Poisson? or without using an offset? on on different data? fixing the dispersion parameter? – Scortchi - Reinstate Monica Jun 13 '22 at 23:43
  • I have tried Poisson on both R and SAS, and the results are the same. I have also tried without an offset on R and SAS, and the results are the same as well. The difference is occurring only when I add the weight factor. Regarding fixing the dispersion parameter, I used init.theta (in glm.nb) fixing at the inverse of the dispersion parameter obtained from SAS, and it did not accept it, computed its own (also, with that theta, I must mention that the loglikelihood and AIC values are better in the results from R). – Meg Jun 14 '22 at 13:22

0 Answers0