To manually verify the predictions derived from using polr() from package MASS, assume a situation with a categorical dependent variable $Y$ with ordered categories $1, \ldots, g, \ldots, k$ and predictors $X_{1}, \ldots, X_{j}, \ldots, X_{p}$. polr() assumes the proportional odds model
$$
\text{logit}(p(Y \leqslant g)) = \ln \frac{p(Y \leqslant g)}{p(Y > g)} = \beta_{0_g} - (\beta_{1} X_{1} + \dots + \beta_{p} X_{p})
$$
For possible choices implemented in other functions, see this answer. The logistic function is the inverse of the logit-function, so the predicted probabilities $\hat{p}(Y \leqslant g)$ are
$$
\hat{p}(Y \leqslant g) = \frac{e^{\hat{\beta}_{0_{g}} - (\hat{\beta}_{1} X_{1} + \dots + \hat{\beta}_{p} X_{p})}}{1 + e^{\hat{\beta}_{0_{g}} - (\hat{\beta}_{1} X_{1} + \dots + \hat{\beta}_{p} X_{p})}}
$$
The predicted category probabilities are $\hat{P}(Y=g) = \hat{P}(Y \leq g) - \hat{P}(Y \leq g-1)$. Here is a reproducible example in R with two predictors $X_{1}, X_{2}$. For an ordinal $Y$ variable, I cut a simulated continuous variable into 4 categories.
set.seed(1.234)
N <- 100 # number of observations
X1 <- rnorm(N, 5, 7) # predictor 1
X2 <- rnorm(N, 0, 8) # predictor 2
Ycont <- 0.5*X1 - 0.3*X2 + 10 + rnorm(N, 0, 6) # continuous dependent variable
Yord <- cut(Ycont, breaks=quantile(Ycont), include.lowest=TRUE,
labels=c("--", "-", "+", "++"), ordered=TRUE) # ordered factor
Now fit the proportional odds model using polr() and get the matrix of predicted category probabilities using predict(polr(), type="probs").
> library(MASS) # for polr()
> polrFit <- polr(Yord ~ X1 + X2) # ordinal regression fit
> Phat <- predict(polrFit, type="probs") # predicted category probabilities
> head(Phat, n=3)
-- - + ++
1 0.2088456 0.3134391 0.2976183 0.1800969
2 0.1967331 0.3068310 0.3050066 0.1914293
3 0.1938263 0.3051134 0.3067515 0.1943088
To manually verify these results, we need to extract the parameter estimates, from these calculate the predicted logits, from these logits calculate the predicted probabilities $\hat{p}(Y \leqslant g)$, and then bind the predicted category probabilities to a matrix.
ce <- polrFit$coefficients # coefficients b1, b2
ic <- polrFit$zeta # intercepts b0.1, b0.2, b0.3
logit1 <- ic[1] - (ce[1]*X1 + ce[2]*X2)
logit2 <- ic[2] - (ce[1]*X1 + ce[2]*X2)
logit3 <- ic[3] - (ce[1]*X1 + ce[2]*X2)
pLeq1 <- 1 / (1 + exp(-logit1)) # p(Y <= 1)
pLeq2 <- 1 / (1 + exp(-logit2)) # p(Y <= 2)
pLeq3 <- 1 / (1 + exp(-logit3)) # p(Y <= 3)
pMat <- cbind(p1=pLeq1, p2=pLeq2-pLeq1, p3=pLeq3-pLeq2, p4=1-pLeq3) # matrix p(Y = g)
Compare to the result from polr().
> all.equal(pMat, Phat, check.attributes=FALSE)
[1] TRUE
For the predicted categories, predict(polr(), type="class") just picks - for each observation - the category with the highest probability.
> categHat <- levels(Yord)[max.col(Phat)] # category with highest probability
> head(categHat)
[1] "-" "-" "+" "++" "+" "--"
Compare to result from polr().
> facHat <- predict(polrFit, type="class") # predicted categories
> head(facHat)
[1] - - + ++ + --
Levels: -- - + ++
> all.equal(factor(categHat), facHat, check.attributes=FALSE) # manual verification
[1] TRUE
predictfunction differ from the ones you generated manually? What is the structure of your dependent variable? Please provide a reproducible example. – Sven Hohenstein Oct 23 '12 at 08:07predictby default returns the response with the highest probability for each data point. Perhaps you want the actual probabilities for each possible response? If so, check the documentation for options for thepredictfunction, I know it exists, but don't remember the syntax. – Aaron left Stack Overflow Oct 23 '12 at 12:21Also, no, it's not that I just want R to do this, I'm trying to understand where the behavior is deviating from my expectations (because I suspect the error is on my part, not R).
– prototoast Oct 23 '12 at 15:52polr()against other functions? You could trylrm()from packagerms:lrmFit <- lrm(y ~ pop0 + inc0); predict(lrmFit, type="fitted.ind"). Another option isvglm()from packageVGAM:vglmFit <- vglm(y ~ pop0 + inc0, family=propodds); predict(vglmFit, type="response"). Both return the matrix of predicted category probabilities. See my answer to get the predicted categories from there. – caracal Oct 23 '12 at 16:19