I'm using the following code in R to predict votes (e.g. non-negative integer count data).
m1<-glm(votes~.,data=trainset,family=poisson(link="sqrt"))
pred1<-predict(m1,newdata=testset,type="response")
> pred1[1:10]
3 4 5 6 7 8 11 12 14 15
0.8000618 1.4012718 0.9924539 0.9260005 0.2820739 0.3333504 0.3238205 0.5786863 1.1216740 1.0114024
> str(pred1)
Named num [1:14999] 0.8 1.401 0.992 0.926 0.282 ...
- attr(*, "names")= chr [1:14999] "3" "4" "5" "6" ...
Q1: I don't understand the structure of the predicted output. That is, isn't pred1 simply a numeric vector? Why does each element in pred1 have a name? For example, the first element 0.8000618 has a name of 3, and so on. What's the purpose of that?
Q2: I get better results using link="sqrt" than I do with link="log" when computing model m1. I think what this link setting does is model the response as either sqrt(prediction) or log(prediction) as opposed to simply prediction. So the better fit with square root is saying the data is better modeled by such an equation. Is there anything deeper than that going on? Also, although R says link="identity" should be possible, it always gives an error Error: no valid set of coefficients has been found: please supply starting values, is the problem with me or R?
link="log" is the canonical link for the poisson model.
– Alex Nov 16 '15 at 22:47