Dependent variable - number of visitors to a historical monument by day
Independent variables - Daily average temperature, relative humidity, number of tourists visiting the state by day, etc.
My task is to understand the key drivers that influence the number of visitors. So far, I have done the following:
1) Fitted a multiple regression in R with LN(Number of Visitors) ~ Independent variables using MASS package lm() function. I transformed some of the independent variables too per recommendation from BoxCoxTrans() from caret package. The resulting regression diagnostic look pretty decent to me. The R-square was approximately 25%, which is satisfactory to me, given the data that I have.
2) I have also tried fitting a glm.nb() function from MASS package because the dependent variable showed over-dispersion per a test for over-dispersion. The resulting regression diagnostic look pretty decent to me.
The residuals are pretty much well-behaved in both cases, given that it's a real world data. However, the results from the two models are vastly different in terms their respective coefficients of determination, e.g., increase in temperature by 1 degree causes increase in the number of visitors by 10% per the regression model and 30% per the GLM model with Poisson or quasi-Poisson distribution.
I would like to cross-validate with the community to make sure that I am not using an inappropriate techniques for the type data I have and which one of the techniques is more suited for the given data. Thank you!
Output from the lm() is as follows:
Call:
lm(formula = CT ~ Review + MinTemp + RH + Delta + xRate + PercIntl + PercOnline + CSI, data = transformed)
Residuals:
Min 1Q Median 3Q Max
-2.31465 -0.57769 -0.03228 0.56113 2.96008
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.150803 0.009828 -15.344 < 2e-16 ***
Review 0.103383 0.009788 10.562 < 2e-16 ***
MinTemp -0.275583 0.012636 -21.809 < 2e-16 ***
RH 0.190549 0.011313 16.844 < 2e-16 ***
DeltaMax 0.030461 0.010626 2.867 0.00416 **
xRate 0.181127 0.013951 12.983 < 2e-16 ***
PercIntl 0.318809 0.010610 30.049 < 2e-16 ***
PercOnline -0.212168 0.011827 -17.939 < 2e-16 ***
CCI -0.080672 0.011022 -7.319 2.79e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8085 on 6855 degrees of freedom
(1847 observations deleted due to missingness)
Multiple R-squared: 0.2495, Adjusted R-squared: 0.2486
F-statistic: 284.8 on 8 and 6855 DF, p-value: < 2.2e-16
Output from the glm() is as follows:
Call:
glm(formula = CT ~ Reviews + Delta + RH + xRate +
PercOnline + PercIntl + CSI + Temp, family = quasipoisson(),
data = tp, subset = !selector)
Deviance Residuals:
Min 1Q Median 3Q Max
-46.404 -12.484 -5.329 5.063 100.196
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.7583433 1.0223177 -3.676 0.000239 ***
Review -0.0201375 0.0010352 -19.453 < 2e-16 ***
DeltaMax 0.0063672 0.0015213 4.185 2.89e-05 ***
RH 0.0019643 0.0006838 2.873 0.004083 **
xRate 0.1009589 0.0082975 12.167 < 2e-16 ***
PercOnline -0.0233884 0.0012857 -18.192 < 2e-16 ***
PercIntl 0.0148912 0.0011250 13.236 < 2e-16 ***
CSI -0.0068745 0.0009345 -7.356 2.13e-13 ***
Temp 0.2362620 0.0149763 20.428 1.39e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for quasipoisson family taken to be 312.9381)
Null deviance: 2145768 on 6303 degrees of freedom
Residual deviance: 1591885 on 6295 degrees of freedom
(331 observations deleted due to missingness)
AIC: 67908
Number of Fisher Scoring iterations: 5
Analysis of Deviance Table (Type II tests)
Response: CT
LR Chisq Df Pr(>Chisq)
Review 372.02 1 < 2.2e-16 ***
Delta 17.47 1 2.912e-05 ***
RH 8.24 1 0.004103 **
xRate 150.20 1 < 2.2e-16 ***
PercOnline 337.02 1 < 2.2e-16 ***
PercIntl 163.27 1 < 2.2e-16 ***
CSI 53.68 1 2.362e-13 ***
Temp 42.02 1 9.052e-11 ***
glm.nb()you meant the GLM quasi-Poisson distribution? (I would expectglm(family='quasipoisson', ...) Both account of over-dispersion I am just uncertain which one you used and you refer it. – usεr11852 Feb 05 '16 at 18:47lmoutput so we can compare... :) In addition, 1. can you check the models' fit? 2. you mention that you BoxCoxed some variables, are these variables use in both models. 3. What kind off temperature scale is this? Have you centred/scaled any of the predictors? – usεr11852 Feb 06 '16 at 01:19