I have an over-dispersed count dataset and I want to add an offset to my negative binomial on the RHS to create a rate of events for y (see this great answer for further explanation). Next, I want to create a predicted plot of my results.
However, after following examples from smarter folks than myself, I get strange fitted values from the predict() function depending on where I put the offset on in my model.
Here's my example:
library(MASS)
y <- c(0,0,20,28,20,24,8,8,4,9,0,4,0,35,3,8,4,5,7,10,19,21,19,2,0,12,32,5,1,1,1,16,1,13,7,6,7,1,7,6,8,9,13,37,17,2,2,1,5,1,2,6,8,9,3,15,6,1,7,0,1,0,11,0,4,12,3,36,2,0,1,8,4,3,25,12,4,2,20,2,24,5,25,2,8,1,1,6,29,50,14,12,7,40,1,0,1,0,1,12,0,1,0,1,1,4,3,1,13,6,2,4,19,3,8,4,5,10,46,4,1,45,4,4,4,2,9,8,43,3,1,22,5,10,1,4,0,17,4,5,3,3,1,28,32,1,9,1,4,27,7,9,23,1,4,19,1,1,2,3,2,6,11,2,54,7,19,22,23,18,27,4,2,3,2,16,5,19,16,8,9,2,8,2,18,2,6,16,13,6,4,7,40,6,16,5,12,17,0,1,6,40,93,16,6,19,6,8,31,1,4,0,17,3,6,13,23,9,12,7,8,2,14,16,15,28,3,20,35,16,5,7,2,51,6,1,17,2,13,19,13,18,1,5,13,7,0,10,11,21,15,0,31,17,8,2,5,1,3,59,9,3,9,1,2,0,35,5,6,2,2,7,9,34,1,32,8,2,9,0,2,0,4,33,36,13,44,4,17,2,0,0,9,15,14,19,1,11,16,80,8,29,3,18,20,7,6,10,14,54,9,25,4,22,9,16,2,7,7,41,11,4,32,4,16,7,6,51,7,9,8,7,19,82,4,43,3,7,17,53,24,12,23,82,5,10,16,57,9,8,3,16,20,37,4,29,15,11,13,16,4,54,0,8,10,28,53,7,6,23,0,48,11,21,38,4,6,18,13,9,6,17,17,15,83,15,3,17,8,44,6,43,13,67,47,14,11)
x <- c(-35.93,-35.81,-33.93,-33.56,-32.87,-32.11,-31.47,-31.29,-31.27,-30.76,-30.54,-29.65,-29.39,-29.06,-28.99,-28.50,-28.16,-26.72,-25.60,-24.83,-24.29,-24.24,-23.07,-22.92,-22.72,-22.68,-22.53,-22.42,-22.25,-22.08,-21.86,-21.61,-21.45,-21.35,-21.33,-21.16,-20.70,-20.52,-20.35,-20.27,-20.18,-20.14,-19.96,-19.89,-19.89,-19.01,-18.83,-18.68,-18.54,-18.51,-18.16,-17.96,-17.91,-17.80,-17.67,-17.54,-17.35,-17.23,-16.92,-16.79,-16.74,-16.63,-16.59,-16.59,-16.33,-16.33,-16.20,-15.92,-15.61,-15.51,-15.32,-15.30,-15.29,-15.24,-15.09,-15.05,-14.96,-14.69,-14.56,-13.92,-13.82,-13.79,-13.79,-13.71,-13.65,-13.51,-12.90,-12.73,-12.66,-12.51,-12.48,-12.36,-12.32,-12.30,-12.19,-12.18,-12.17,-11.87,-11.82,-11.60,-11.52,-11.36,-11.04,-10.97,-10.91,-10.75,-10.75,-10.52,-10.45,-10.44,-10.22,-10.19,-10.15,-9.95,-9.92,-9.68,-9.59,-9.18,-8.92,-8.68,-8.33,-8.28,-8.25,-8.02,-8.00,-7.98,-7.97,-7.87,-7.84,-7.68,-7.55,-7.50,-7.25,-7.14,-6.96,-6.74,-6.34,-6.33,-6.10,-6.03,-5.86,-5.83,-5.81,-5.81,-5.79,-5.76,-5.67,-5.35,-5.27,-5.21,-5.19,-5.15,-4.98,-4.77,-4.61,-4.30,-4.20,-4.07,-4.05,-4.00,-3.97,-3.94,-3.82,-3.75,-3.75,-3.65,-3.50,-3.44,-3.27,-3.17,-3.16,-2.77,-2.58,-2.36,-2.34,-2.30,-2.30,-2.29,-1.92,-1.86,-1.82,-1.75,-1.66,-1.56,-1.52,-1.45,-1.23,-1.22,-1.15,-0.89,-0.71,-0.70,-0.63,-0.56,-0.48,-0.47,-0.37,-0.31, 0.00, 0.02, 0.06, 0.07, 0.08, 0.12, 0.12, 0.21, 0.22, 0.41, 0.57, 0.59, 1.05, 1.14, 1.19, 1.33, 1.36, 1.40, 1.42, 1.43, 1.43, 1.68, 1.68, 1.80, 1.88, 1.89, 1.90, 1.99, 2.09, 2.09, 2.14, 2.27, 2.27, 2.32, 2.40, 2.64, 2.67, 2.78, 3.01, 3.02, 3.23, 3.25, 3.33, 3.40, 3.46, 3.47, 3.58, 3.62, 3.64, 3.66, 3.93, 4.02, 4.08, 4.19, 4.37, 4.53, 4.56, 4.61, 4.61, 4.69, 5.05, 5.14, 5.21, 5.58, 5.69, 5.70, 5.94, 6.12, 6.32, 6.33, 6.44, 6.50, 6.66, 6.83, 6.96, 7.17, 7.23, 7.41, 7.60, 7.62, 7.82, 7.86, 7.91, 7.91, 7.96, 8.09, 8.20, 8.31, 8.54, 8.55, 8.60, 8.66, 8.77, 8.84, 8.87, 8.92, 8.97, 9.13, 9.23, 9.34, 9.59, 9.75, 10.23, 10.26, 10.30, 10.32, 10.60, 10.66, 10.79, 10.94, 10.95, 11.31, 11.38, 11.49, 11.76, 11.82, 12.57, 12.61, 12.68, 12.80, 13.05, 13.10, 13.34, 13.52, 13.93, 14.04, 14.04, 14.67, 14.68, 14.72, 14.79, 14.97, 15.20, 15.53, 15.63, 15.92, 15.92, 15.99, 16.00, 16.12, 17.23, 17.35, 17.55, 17.61, 17.75, 17.77, 17.78, 17.83, 18.05, 18.09, 18.17, 18.20, 18.45, 18.61, 18.73, 18.96, 19.10, 19.53, 19.58, 19.65, 20.26, 20.52, 20.66, 20.96, 21.04, 21.05, 21.06, 21.36, 22.33, 22.54, 22.57, 22.94, 24.29, 24.38, 24.58, 24.58, 24.83, 24.95, 25.51, 25.70, 26.35, 27.13, 27.37, 28.29, 28.80, 28.93, 29.26, 29.29, 29.82, 31.25, 31.41, 32.77, 32.85, 33.36, 34.01, 34.60, 36.67, 37.56, 42.89)
z <- c(237,155,2523,960,2907,1972,531,353,238,1402,155,201,91,882,189,919,63,1322,447,384,2876,2025,778,347,248,1325,1658,419,247,193,138,1278,130,420,541,1341,935,111,1256,283,727,369,547,3322,719,253,372,481,949,390,247,789,1529,1086,594,735,479,186,828,342,1004,26,512,122,567,1270,269,959,214,94,303,302,349,129,1825,1157,882,740,284,358,894,296,449,1753,792,75,155,804,2206,4251,238,718,777,557,86,536,201,202,230,1105,451,211,667,92,603,175,304,110,429,408,531,289,174,349,320,1345,1320,785,1762,346,326,1222,300,309,591,380,907,649,840,212,318,854,463,737,138,503,264,799,416,808,323,530,822,2326,2484,86,628,347,342,640,386,1460,837,27,169,1147,312,1003,211,576,458,161,2649,1021,739,360,1111,1132,669,501,1182,174,124,653,1247,829,1008,1122,916,356,570,1387,712,211,993,54,347,1501,837,1695,157,767,909,1576,780,629,1213,1087,189,375,570,1393,843,2122,852,1141,1507,304,658,131,898,132,1123,136,1663,614,1318,710,343,639,954,217,652,755,813,1085,368,778,2715,923,1822,1081,130,1532,813,66,1563,661,1567,1056,473,891,175,472,696,1251,643,540,2693,973,959,103,952,595,408,110,486,400,1183,2324,607,353,300,246,298,222,1367,975,578,433,2344,403,3908,1640,750,1268,520,346,396,960,345,130,252,675,840,874,1029,237,1655,689,530,241,467,488,751,734,185,461,566,1372,1497,1216,848,1389,2170,423,1753,953,600,1267,1168,2950,163,1743,1804,2074,651,610,2775,2319,3970,469,874,626,1001,443,279,1580,665,783,1787,743,1462,1686,203,1375,1062,1742,1200,950,1674,871,853,1449,720,280,1670,1392,1938,639,570,1744,761,2645,126,1571,1293,1062,1524,1143,522,2098,363,553,1028,1652,1618,887,1521,1004,210,1175,366,756,1167,1137,243,1483,3727,611,1493,656,378,1498,2269,822,802,1655,1595,1515,979,1584,2448,1998,1525,602,2592)
data <- data.frame(y,x,z)
In the first model, I omit the offset, just for a baseline check. But this defeats the purpose of creating a rate interpretation:
m1 <- glm.nb(y ~ x, data=data)
The problem arises in the second model when I add the offset(log(z)) inside the formula.
m2 <- glm.nb(y ~ x + offset(log(z)),data=data)
In fumbling around, I also moved the offset outside of the formula, but this converts the offset to a weights function across the whole model, where the z is now weighting across both y and x:
m3 <- glm.nb(y ~ x, offset(log(z)),data=data)
When I run the predict() function on these 3 models
pred <- cbind(data, "m1" = predict(m1, type="link", se.fit=TRUE)[1:2],
"m2" = predict(m2, type="link", se.fit=TRUE)[1:2],
"m3" = predict(m3, type="link", se.fit=TRUE)[1:2])
head(pred,10)
y x z m1.fit m1.se.fit m2.fit m2.se.fit m3.fit m3.se.fit
1 0 -35.93 237 1.860281 0.1347283 1.076164 0.11099828 1.983966 0.05165139
2 0 -35.81 155 1.862478 0.1343475 0.652252 0.11068868 1.985993 0.05150751
3 20 -33.93 2523 1.896910 0.1284076 3.453355 0.10585949 2.017758 0.04926325
4 28 -33.56 960 1.903686 0.1272450 2.489313 0.10491404 2.024009 0.04882388
5 20 -32.87 2907 1.916324 0.1250827 3.601413 0.10315558 2.035667 0.04800671
6 24 -32.11 1972 1.930243 0.1227105 3.217918 0.10122610 2.048508 0.04711008
7 8 -31.47 531 1.941964 0.1207209 1.909731 0.09960761 2.059322 0.04635798
8 8 -31.29 353 1.945261 0.1201627 1.502521 0.09915349 2.062363 0.04614696
9 4 -31.27 238 1.945627 0.1201008 1.108444 0.09910306 2.062701 0.04612352
10 9 -30.76 1402 1.954968 0.1185229 2.884901 0.09781922 2.071318 0.04552694
...
m1.fit and m3.fit have nice orderly fitted values, which plot nicely, while m2.fit is kind of random. In calculating the coefficients from m2.fit by hand, I don't get the same values at the given value of x either. I've spent many hours going over this problem and the data, simplifying down to this example. I'm not sure what's going on here. I have about 25 variables that I want to fit and this problem arises in all of them.
Question:
Can someone help me understand why model 2 is producing unordered fitted values when the formula includes the offset y ~ x + offset(log(z))? Are these not the predicted logged values of the rate of $log(y/z) = x$? I'm open to a better way of doing this too if you have any suggestions.
Thanks in advance for the answer!
z = total rentersin a given tract andy = evictionsfor that tract. This study is examining the effects of neighborhood change on evictions wherexis the median centered value for the given variable. Thank you! – Tim Jun 26 '17 at 07:10