Using RMSE and AIC to compare three separate "final" models (one with double observations)?

Question

I'm looking at three models (linear mixed effect) looking at crime. The first looks at total crime so there are ~96000 observations. In the second model, I look at crime as a function of crime type (so the DV is count and I include a categorical variable: property/violent and I have twice the number of observations: 192,000) with covariates (population, pop density, per capita income, education spending), and then in the last model I include an interaction between crime type and covariates (so crime * (population + population density + per capita income..., also 192,000 observations), in order to look at the "effects" of covariates on different levels of crime.

In the first model, I have an RMSE of .077; in the second, I have an RMSE of .17; and in the third model, I have an RMSE of .193. I'd expect that the RMSE would decrease as a function of crime type as the assumption would be that the model would better predict crime; however, that isn't the case. Is it just that the first model is the best predictive model? Also, I was wondering whether AICs are comparable if you have exactly twice the amount of observations; or is there still no way to compare model AICs with different #s of observations?

This is the function I'm using to compute RMSE

RMSE = function(m, o,n){
  (sqrt(mean((m - o)^2)))/n
}

The first model is lmer(log(CRIME_TOTAL + 1) ~ cent.log.pop + cent.log.pop.dens + per_capita_income + cent.EXP_STUDENT + white + diff.dem + black + median_gross_rent + ba + prop5.17.pov + asian + hs + officers + no.grad.hs + (year|PLACE_ID) + (1|COUNTY_ID) + (1|STATE), control = lmerControl(optimizer= "nloptwrap", calc.derivs = FALSE), na.action = 'na.omit', REML = FALSE, city)

The second model is lmer(log(COUNT + 1) ~ CRIME*cent.EXP_STUDENT + cent.log.pop + cent.log.pop.dens + per_capita_income + white + diff.dem + black + median_gross_rent + ba + prop5.17.pov + asian + hs + officers + no.grad.hs + (year|PLACE_ID) + (1|COUNTY_ID) + (1|STATE), control = lmerControl(optimizer= "nloptwrap", calc.derivs = FALSE), na.action = 'na.omit', REML = FALSE, city.v.p)

And the third model is lmer(log(COUNT + 1) ~ CRIME*( cent.log.pop + cent.log.pop.dens + white + black + per_capita_income + no.grad.hs + prop5.17.pov + ma.plus + hs + ba + median_gross_rent+cent.EXP_STUDENT +unemployment_rate + asian + diff.dem) + officers + (year|PLACE_ID) + (1|COUNTY_ID) + (1|STATE), control = lmerControl(optimizer= "nloptwrap", calc.derivs = FALSE), na.action = 'na.omit', REML = FALSE, city.v.p)

Thanks!

So the DV is the same for each model ?Could you include the model code and summary output for each one please ? — Robert Long, Jul 30 '20 at 03:20
Well yes and no...the DV is always crime but in the latter two models crime is gathered such that it represents property and violent crime. — James, Jul 30 '20 at 18:40
Also, let me know if I can provide any additional information! — James, Jul 31 '20 at 21:28
You have count data, counts is an extensive variable, so a log link function is indicated ... have a look at https://stats.stackexchange.com/questions/142338/goodness-of-fit-and-which-model-to-choose-linear-regression-or-poisson/142353#142353 — kjetil b halvorsen, Aug 10 '20 at 20:34
Thanks, Kjetil. I began with a poisson regression using GLME. GLME can be a bit more difficult to fit and with the advice of Doug Bates and some others on the sig forum, I switched to the given model. — James, Aug 11 '20 at 21:47

Using RMSE and AIC to compare three separate "final" models (one with double observations)?

0 Answers0