Using Poisson GLM for visits to a historical monument - Am I using the right method?

Question

Dependent variable - number of visitors to a historical monument by day

Independent variables - Daily average temperature, relative humidity, number of tourists visiting the state by day, etc.

My task is to understand the key drivers that influence the number of visitors. So far, I have done the following:

1) Fitted a multiple regression in R with LN(Number of Visitors) ~ Independent variables using MASS package lm() function. I transformed some of the independent variables too per recommendation from BoxCoxTrans() from caret package. The resulting regression diagnostic look pretty decent to me. The R-square was approximately 25%, which is satisfactory to me, given the data that I have.

2) I have also tried fitting a glm.nb() function from MASS package because the dependent variable showed over-dispersion per a test for over-dispersion. The resulting regression diagnostic look pretty decent to me.

The residuals are pretty much well-behaved in both cases, given that it's a real world data. However, the results from the two models are vastly different in terms their respective coefficients of determination, e.g., increase in temperature by 1 degree causes increase in the number of visitors by 10% per the regression model and 30% per the GLM model with Poisson or quasi-Poisson distribution.

I would like to cross-validate with the community to make sure that I am not using an inappropriate techniques for the type data I have and which one of the techniques is more suited for the given data. Thank you!

Output from the lm() is as follows:

 Call:
 lm(formula = CT ~ Review + MinTemp + RH + Delta + xRate + PercIntl + PercOnline + CSI, data = transformed)

 Residuals:
     Min       1Q   Median       3Q      Max 
-2.31465 -0.57769 -0.03228  0.56113  2.96008  

 Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
 (Intercept)  -0.150803   0.009828 -15.344  < 2e-16 ***
 Review        0.103383   0.009788  10.562  < 2e-16 ***
 MinTemp      -0.275583   0.012636 -21.809  < 2e-16 ***
 RH            0.190549   0.011313  16.844  < 2e-16 ***
 DeltaMax      0.030461   0.010626   2.867  0.00416 ** 
 xRate         0.181127   0.013951  12.983  < 2e-16 ***
 PercIntl      0.318809   0.010610  30.049  < 2e-16 ***
 PercOnline   -0.212168   0.011827 -17.939  < 2e-16 ***
 CCI          -0.080672   0.011022  -7.319 2.79e-13 ***
 ---
 Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 Residual standard error: 0.8085 on 6855 degrees of freedom
  (1847 observations deleted due to missingness)
 Multiple R-squared:  0.2495,   Adjusted R-squared:  0.2486 
 F-statistic: 284.8 on 8 and 6855 DF,  p-value: < 2.2e-16

Output from the glm() is as follows:

 Call:
glm(formula = CT ~ Reviews + Delta + RH + xRate + 
    PercOnline + PercIntl + CSI + Temp, family = quasipoisson(), 
    data = tp, subset = !selector)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-46.404  -12.484   -5.329    5.063  100.196  

Coefficients:
               Estimate        Std. Error     t value    Pr(>|t|)
(Intercept)   -3.7583433   1.0223177         -3.676     0.000239 ***

Review        -0.0201375    0.0010352        -19.453    < 2e-16 ***
DeltaMax       0.0063672    0.0015213          4.185     2.89e-05 ***
RH             0.0019643    0.0006838          2.873     0.004083 **
xRate          0.1009589    0.0082975         12.167    < 2e-16 ***
PercOnline    -0.0233884    0.0012857        -18.192    < 2e-16 ***
PercIntl       0.0148912    0.0011250         13.236    < 2e-16 ***
CSI           -0.0068745    0.0009345         -7.356    2.13e-13 ***
Temp           0.2362620    0.0149763         20.428    1.39e-10 ***

---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
    (Dispersion parameter for quasipoisson family taken to be 312.9381)

        Null deviance: 2145768  on 6303  degrees of freedom
        Residual deviance: 1591885  on 6295  degrees of freedom
        (331 observations deleted due to missingness)

AIC: 67908

Number of Fisher Scoring iterations: 5

Analysis of Deviance Table (Type II tests)

Response: CT
                   LR Chisq       Df    Pr(>Chisq)
Review            372.02         1     < 2.2e-16 ***
Delta             17.47          1      2.912e-05 ***
RH                8.24           1      0.004103 ** 
xRate             150.20         1     < 2.2e-16 ***
PercOnline        337.02         1     < 2.2e-16 ***
PercIntl          163.27         1     < 2.2e-16 ***
CSI               53.68          1     2.362e-13 ***
Temp              42.02          1     9.052e-11 ***

Nice question. Is there a chance we see the models' output? Just to clarify by glm.nb() you meant the GLM quasi-Poisson distribution? (I would expect glm(family='quasipoisson', ...) Both account of over-dispersion I am just uncertain which one you used and you refer it. — usεr11852, Feb 05 '16 at 18:47
Thanks! I tried both - negative binomial and quasi-Poisson. In this particular instance, I was just trying different things. Because of over-dispersion, I think quasi-Poisson is more appropriate. — States.the.Obvious, Feb 05 '16 at 18:52
You could also add the lm output so we can compare... :) In addition, 1. can you check the models' fit? 2. you mention that you BoxCoxed some variables, are these variables use in both models. 3. What kind off temperature scale is this? Have you centred/scaled any of the predictors? — usεr11852, Feb 06 '16 at 01:19
Sorry! My formatting sucks! But, to answer your questions- Yes, I checked the model fits - both the models seem to have reasonably good fit based on the qq plot. BoxCoxed variables are only in the lm model. I didn't transform the variables in the glm model. Come to think of it, maybe that's why the interpretation is so different. I will check on that. And, yes, the variables in the lm model were centered and scaled before applying BoxCox. Thank you! — States.the.Obvious, Feb 06 '16 at 02:27
I am glad I could help! I wrote up these findings in a quick summary in my answer below; I give you some general advice too. Good luck with the rest of the analysis! :D — usεr11852, Feb 06 '16 at 09:21

score 1 · Accepted Answer · edited Apr 13 '17 at 12:44

Taking in consideration your comments and model outputs presented, I think that the basic issue you are having is due to having different scaling in your dependent variables. Your idea about using a negative-binomial or a quasi-Poisson model is insightful. Many people turn a blind eye on overdispersion, you have not but unfortunately you stumbles on the scaling of your explanatory variables.

While in the (quasi-)Poisson GLM you used the untransformed explanatory variables in the case of the LM you seem to use transformed explanatory variables in some occasions. This clearly muddles up the interpretation of the models at hand. This is an issue that has been mention in CV a couple of times; for starters I would suggest looking this (relatively long) thread on : Transforming variables for multiple regression in R.

I think that unless you have a good reason using a heavy-handed transformation like the Box-Cox transformation on the predictor variables of your model is a bit redundant. This is matter has also received some great answers in the past; check the thread on: Box-Cox like transformation for independent variables? for starters, I think you will find it enlightening too. As a basic rule, transforming the independent variables is usually done to account for non-linear relations; issues of heteroskedasticity, skewness of the fitted distribution, etc. are usually unaffected by transforming dependent variables. .

Using Poisson GLM for visits to a historical monument - Am I using the right method?

1 Answers1