0

I'm looking at three models (linear mixed effect) looking at crime. The first looks at total crime so there are ~96000 observations. In the second model, I look at crime as a function of crime type (so the DV is count and I include a categorical variable: property/violent and I have twice the number of observations: 192,000) with covariates (population, pop density, per capita income, education spending), and then in the last model I include an interaction between crime type and covariates (so crime * (population + population density + per capita income..., also 192,000 observations), in order to look at the "effects" of covariates on different levels of crime.

In the first model, I have an RMSE of .077; in the second, I have an RMSE of .17; and in the third model, I have an RMSE of .193. I'd expect that the RMSE would decrease as a function of crime type as the assumption would be that the model would better predict crime; however, that isn't the case. Is it just that the first model is the best predictive model? Also, I was wondering whether AICs are comparable if you have exactly twice the amount of observations; or is there still no way to compare model AICs with different #s of observations?

This is the function I'm using to compute RMSE

RMSE = function(m, o,n){
  (sqrt(mean((m - o)^2)))/n
}

The first model is lmer(log(CRIME_TOTAL + 1) ~ cent.log.pop + cent.log.pop.dens + per_capita_income + cent.EXP_STUDENT + white + diff.dem + black + median_gross_rent + ba + prop5.17.pov + asian + hs + officers + no.grad.hs + (year|PLACE_ID) + (1|COUNTY_ID) + (1|STATE), control = lmerControl(optimizer= "nloptwrap", calc.derivs = FALSE), na.action = 'na.omit', REML = FALSE, city) First model output

The second model is lmer(log(COUNT + 1) ~ CRIME*cent.EXP_STUDENT + cent.log.pop + cent.log.pop.dens + per_capita_income + white + diff.dem + black + median_gross_rent + ba + prop5.17.pov + asian + hs + officers + no.grad.hs + (year|PLACE_ID) + (1|COUNTY_ID) + (1|STATE), control = lmerControl(optimizer= "nloptwrap", calc.derivs = FALSE), na.action = 'na.omit', REML = FALSE, city.v.p) Second model output

And the third model is lmer(log(COUNT + 1) ~ CRIME*( cent.log.pop + cent.log.pop.dens + white + black + per_capita_income + no.grad.hs + prop5.17.pov + ma.plus + hs + ba + median_gross_rent+cent.EXP_STUDENT +unemployment_rate + asian + diff.dem) + officers + (year|PLACE_ID) + (1|COUNTY_ID) + (1|STATE), control = lmerControl(optimizer= "nloptwrap", calc.derivs = FALSE), na.action = 'na.omit', REML = FALSE, city.v.p)

Third model output

Thanks!

James
  • 103
  • So the DV is the same for each model ?Could you include the model code and summary output for each one please ? – Robert Long Jul 30 '20 at 03:20
  • Well yes and no...the DV is always crime but in the latter two models crime is gathered such that it represents property and violent crime. – James Jul 30 '20 at 18:40
  • Also, let me know if I can provide any additional information! – James Jul 31 '20 at 21:28
  • You have count data, counts is an extensive variable, so a log link function is indicated ... have a look at https://stats.stackexchange.com/questions/142338/goodness-of-fit-and-which-model-to-choose-linear-regression-or-poisson/142353#142353 – kjetil b halvorsen Aug 10 '20 at 20:34
  • Thanks, Kjetil. I began with a poisson regression using GLME. GLME can be a bit more difficult to fit and with the advice of Doug Bates and some others on the sig forum, I switched to the given model. – James Aug 11 '20 at 21:47

0 Answers0