Predict function on negative binomial produces strange fitted values when adding an offset

Question

I have an over-dispersed count dataset and I want to add an offset to my negative binomial on the RHS to create a rate of events for y (see this great answer for further explanation). Next, I want to create a predicted plot of my results.

However, after following examples from smarter folks than myself, I get strange fitted values from the predict() function depending on where I put the offset on in my model.

Here's my example:

library(MASS)

y <- c(0,0,20,28,20,24,8,8,4,9,0,4,0,35,3,8,4,5,7,10,19,21,19,2,0,12,32,5,1,1,1,16,1,13,7,6,7,1,7,6,8,9,13,37,17,2,2,1,5,1,2,6,8,9,3,15,6,1,7,0,1,0,11,0,4,12,3,36,2,0,1,8,4,3,25,12,4,2,20,2,24,5,25,2,8,1,1,6,29,50,14,12,7,40,1,0,1,0,1,12,0,1,0,1,1,4,3,1,13,6,2,4,19,3,8,4,5,10,46,4,1,45,4,4,4,2,9,8,43,3,1,22,5,10,1,4,0,17,4,5,3,3,1,28,32,1,9,1,4,27,7,9,23,1,4,19,1,1,2,3,2,6,11,2,54,7,19,22,23,18,27,4,2,3,2,16,5,19,16,8,9,2,8,2,18,2,6,16,13,6,4,7,40,6,16,5,12,17,0,1,6,40,93,16,6,19,6,8,31,1,4,0,17,3,6,13,23,9,12,7,8,2,14,16,15,28,3,20,35,16,5,7,2,51,6,1,17,2,13,19,13,18,1,5,13,7,0,10,11,21,15,0,31,17,8,2,5,1,3,59,9,3,9,1,2,0,35,5,6,2,2,7,9,34,1,32,8,2,9,0,2,0,4,33,36,13,44,4,17,2,0,0,9,15,14,19,1,11,16,80,8,29,3,18,20,7,6,10,14,54,9,25,4,22,9,16,2,7,7,41,11,4,32,4,16,7,6,51,7,9,8,7,19,82,4,43,3,7,17,53,24,12,23,82,5,10,16,57,9,8,3,16,20,37,4,29,15,11,13,16,4,54,0,8,10,28,53,7,6,23,0,48,11,21,38,4,6,18,13,9,6,17,17,15,83,15,3,17,8,44,6,43,13,67,47,14,11)
x <- c(-35.93,-35.81,-33.93,-33.56,-32.87,-32.11,-31.47,-31.29,-31.27,-30.76,-30.54,-29.65,-29.39,-29.06,-28.99,-28.50,-28.16,-26.72,-25.60,-24.83,-24.29,-24.24,-23.07,-22.92,-22.72,-22.68,-22.53,-22.42,-22.25,-22.08,-21.86,-21.61,-21.45,-21.35,-21.33,-21.16,-20.70,-20.52,-20.35,-20.27,-20.18,-20.14,-19.96,-19.89,-19.89,-19.01,-18.83,-18.68,-18.54,-18.51,-18.16,-17.96,-17.91,-17.80,-17.67,-17.54,-17.35,-17.23,-16.92,-16.79,-16.74,-16.63,-16.59,-16.59,-16.33,-16.33,-16.20,-15.92,-15.61,-15.51,-15.32,-15.30,-15.29,-15.24,-15.09,-15.05,-14.96,-14.69,-14.56,-13.92,-13.82,-13.79,-13.79,-13.71,-13.65,-13.51,-12.90,-12.73,-12.66,-12.51,-12.48,-12.36,-12.32,-12.30,-12.19,-12.18,-12.17,-11.87,-11.82,-11.60,-11.52,-11.36,-11.04,-10.97,-10.91,-10.75,-10.75,-10.52,-10.45,-10.44,-10.22,-10.19,-10.15,-9.95,-9.92,-9.68,-9.59,-9.18,-8.92,-8.68,-8.33,-8.28,-8.25,-8.02,-8.00,-7.98,-7.97,-7.87,-7.84,-7.68,-7.55,-7.50,-7.25,-7.14,-6.96,-6.74,-6.34,-6.33,-6.10,-6.03,-5.86,-5.83,-5.81,-5.81,-5.79,-5.76,-5.67,-5.35,-5.27,-5.21,-5.19,-5.15,-4.98,-4.77,-4.61,-4.30,-4.20,-4.07,-4.05,-4.00,-3.97,-3.94,-3.82,-3.75,-3.75,-3.65,-3.50,-3.44,-3.27,-3.17,-3.16,-2.77,-2.58,-2.36,-2.34,-2.30,-2.30,-2.29,-1.92,-1.86,-1.82,-1.75,-1.66,-1.56,-1.52,-1.45,-1.23,-1.22,-1.15,-0.89,-0.71,-0.70,-0.63,-0.56,-0.48,-0.47,-0.37,-0.31, 0.00, 0.02, 0.06, 0.07, 0.08, 0.12, 0.12, 0.21, 0.22, 0.41, 0.57, 0.59, 1.05, 1.14, 1.19, 1.33, 1.36, 1.40, 1.42, 1.43, 1.43, 1.68, 1.68, 1.80, 1.88, 1.89, 1.90, 1.99, 2.09, 2.09, 2.14, 2.27, 2.27, 2.32, 2.40, 2.64, 2.67, 2.78, 3.01, 3.02, 3.23, 3.25, 3.33, 3.40, 3.46, 3.47, 3.58, 3.62, 3.64, 3.66, 3.93, 4.02, 4.08, 4.19, 4.37, 4.53, 4.56, 4.61, 4.61, 4.69, 5.05, 5.14, 5.21, 5.58, 5.69, 5.70, 5.94, 6.12, 6.32, 6.33, 6.44, 6.50, 6.66, 6.83, 6.96, 7.17, 7.23, 7.41, 7.60, 7.62, 7.82, 7.86, 7.91, 7.91, 7.96, 8.09, 8.20, 8.31, 8.54, 8.55, 8.60, 8.66, 8.77, 8.84, 8.87, 8.92, 8.97, 9.13, 9.23, 9.34, 9.59, 9.75, 10.23, 10.26, 10.30, 10.32, 10.60, 10.66, 10.79, 10.94, 10.95, 11.31, 11.38, 11.49, 11.76, 11.82, 12.57, 12.61, 12.68, 12.80, 13.05, 13.10, 13.34, 13.52, 13.93, 14.04, 14.04, 14.67, 14.68, 14.72, 14.79, 14.97, 15.20, 15.53, 15.63, 15.92, 15.92, 15.99, 16.00, 16.12, 17.23, 17.35, 17.55, 17.61, 17.75, 17.77, 17.78, 17.83, 18.05, 18.09, 18.17, 18.20, 18.45, 18.61, 18.73, 18.96, 19.10, 19.53, 19.58, 19.65, 20.26, 20.52, 20.66, 20.96, 21.04, 21.05, 21.06, 21.36, 22.33, 22.54, 22.57, 22.94, 24.29, 24.38, 24.58, 24.58, 24.83, 24.95, 25.51, 25.70, 26.35, 27.13, 27.37, 28.29, 28.80, 28.93, 29.26, 29.29, 29.82, 31.25, 31.41, 32.77, 32.85, 33.36, 34.01, 34.60, 36.67, 37.56, 42.89)
z <- c(237,155,2523,960,2907,1972,531,353,238,1402,155,201,91,882,189,919,63,1322,447,384,2876,2025,778,347,248,1325,1658,419,247,193,138,1278,130,420,541,1341,935,111,1256,283,727,369,547,3322,719,253,372,481,949,390,247,789,1529,1086,594,735,479,186,828,342,1004,26,512,122,567,1270,269,959,214,94,303,302,349,129,1825,1157,882,740,284,358,894,296,449,1753,792,75,155,804,2206,4251,238,718,777,557,86,536,201,202,230,1105,451,211,667,92,603,175,304,110,429,408,531,289,174,349,320,1345,1320,785,1762,346,326,1222,300,309,591,380,907,649,840,212,318,854,463,737,138,503,264,799,416,808,323,530,822,2326,2484,86,628,347,342,640,386,1460,837,27,169,1147,312,1003,211,576,458,161,2649,1021,739,360,1111,1132,669,501,1182,174,124,653,1247,829,1008,1122,916,356,570,1387,712,211,993,54,347,1501,837,1695,157,767,909,1576,780,629,1213,1087,189,375,570,1393,843,2122,852,1141,1507,304,658,131,898,132,1123,136,1663,614,1318,710,343,639,954,217,652,755,813,1085,368,778,2715,923,1822,1081,130,1532,813,66,1563,661,1567,1056,473,891,175,472,696,1251,643,540,2693,973,959,103,952,595,408,110,486,400,1183,2324,607,353,300,246,298,222,1367,975,578,433,2344,403,3908,1640,750,1268,520,346,396,960,345,130,252,675,840,874,1029,237,1655,689,530,241,467,488,751,734,185,461,566,1372,1497,1216,848,1389,2170,423,1753,953,600,1267,1168,2950,163,1743,1804,2074,651,610,2775,2319,3970,469,874,626,1001,443,279,1580,665,783,1787,743,1462,1686,203,1375,1062,1742,1200,950,1674,871,853,1449,720,280,1670,1392,1938,639,570,1744,761,2645,126,1571,1293,1062,1524,1143,522,2098,363,553,1028,1652,1618,887,1521,1004,210,1175,366,756,1167,1137,243,1483,3727,611,1493,656,378,1498,2269,822,802,1655,1595,1515,979,1584,2448,1998,1525,602,2592)

data <- data.frame(y,x,z)

In the first model, I omit the offset, just for a baseline check. But this defeats the purpose of creating a rate interpretation:

m1 <- glm.nb(y ~ x, data=data)

The problem arises in the second model when I add the offset(log(z)) inside the formula.

m2 <- glm.nb(y ~ x + offset(log(z)),data=data)

In fumbling around, I also moved the offset outside of the formula, but this converts the offset to a weights function across the whole model, where the z is now weighting across both y and x:

m3 <- glm.nb(y ~ x, offset(log(z)),data=data)

When I run the predict() function on these 3 models

pred <- cbind(data, "m1" = predict(m1, type="link", se.fit=TRUE)[1:2],
                "m2" = predict(m2, type="link", se.fit=TRUE)[1:2],
                "m3" = predict(m3, type="link", se.fit=TRUE)[1:2])
head(pred,10)

    y      x    z   m1.fit m1.se.fit   m2.fit  m2.se.fit   m3.fit  m3.se.fit
1   0 -35.93  237 1.860281 0.1347283 1.076164 0.11099828 1.983966 0.05165139
2   0 -35.81  155 1.862478 0.1343475 0.652252 0.11068868 1.985993 0.05150751
3  20 -33.93 2523 1.896910 0.1284076 3.453355 0.10585949 2.017758 0.04926325
4  28 -33.56  960 1.903686 0.1272450 2.489313 0.10491404 2.024009 0.04882388
5  20 -32.87 2907 1.916324 0.1250827 3.601413 0.10315558 2.035667 0.04800671
6  24 -32.11 1972 1.930243 0.1227105 3.217918 0.10122610 2.048508 0.04711008
7   8 -31.47  531 1.941964 0.1207209 1.909731 0.09960761 2.059322 0.04635798
8   8 -31.29  353 1.945261 0.1201627 1.502521 0.09915349 2.062363 0.04614696
9   4 -31.27  238 1.945627 0.1201008 1.108444 0.09910306 2.062701 0.04612352
10  9 -30.76 1402 1.954968 0.1185229 2.884901 0.09781922 2.071318 0.04552694
...

m1.fit and m3.fit have nice orderly fitted values, which plot nicely, while m2.fit is kind of random. In calculating the coefficients from m2.fit by hand, I don't get the same values at the given value of x either. I've spent many hours going over this problem and the data, simplifying down to this example. I'm not sure what's going on here. I have about 25 variables that I want to fit and this problem arises in all of them.

Question:

Can someone help me understand why model 2 is producing unordered fitted values when the formula includes the offset y ~ x + offset(log(z))? Are these not the predicted logged values of the rate of $log(y/z) = x$? I'm open to a better way of doing this too if you have any suggestions.

Thanks in advance for the answer!

Gordon Smyth · Accepted Answer · 2017-06-26T06:48:47.403

Are these not the predicted logged values of the rate?

No, they're not. The predictions are for the expected frequency of events, not for the rate of events. The frequency of events obviously depends on $z$ as well as on $x$ and hence is not an ordered function of $x$ alone.

You don't explain what $z$ is in your data, but presumably it is the total population "at risk" size for each case.

To convert frequencies to rates you must divide by $z$. The only correct model is m2. To get the predicted rate, you need

PredictedRate <- exp(m2[,1]) / z

You will find that this is an increasing function of $x$:

plot(x, PredictedRate)

Note that model m1 is wrong because it doesn't incorporate the population at risk size ($z$) into your model. Model m3 is doubly wrong -- not only does it ignore $z$ in the fitted model but it specifies inappropriate weights. You should quickly abandon these models!

This makes complete sense, I will definitely abandon those other models. I can't thank you enough for the quick reply! You are correct, I am looking at renter evictions where z = total renters in a given tract and y = evictions for that tract. This study is examining the effects of neighborhood change on evictions where x is the median centered value for the given variable. Thank you! — Tim, Jun 26 '17 at 07:10

Predict function on negative binomial produces strange fitted values when adding an offset

1 Answers1

Linked