1

By default, R encodes ordinal variables (factor(..., ordered=TRUE)) with "polynomial contrasts". I have read at several places that these are only useful for evenly spaced ordered levels, so this contradicts the very idea of an ordinal varaible (ordered, but no specified spacing).

Moreover, evenly spaced levels can much simpler be encoded as consecutive integers. Is there any use case where an encoding as "polynomial contrasts" makes sense and what is the advantage over an integer representation, e.g., in regression?

cdalitz
  • 5,132

1 Answers1

3

Ultimately all (standard) contrast codings allow to use all degrees of freedom, meaning that polynomial contrasts, as all other standard choices for contrasts, allow a fully flexible modelling of the effects of every single category. Mathematically, at least as far as the fit is concerned, all such contrast choices are equivalent, the difference concerns the interpretation of the estimated coefficients and the standard hypotheses tested.

If you just represent your ordered categories by integers, a linear model will assume that they affect the $y$ linearly, whereas polynomial contrasts allow for all kinds of nonlinear impact (if I'm not mistaken, you can emulate the mathematically same thing by using linear representation and then incorporate a squared, cubic, etc. term). What polynomial contrasts allow for is to test a model in which the impact is just linear against a more complex relationship involving higher order polynomial terms, and then to go on further and test a squared model against one having cubic and maybe even higher order terms.

I wouldn't say that "these are only useful for evenly spaced ordered levels". This is a subtle issue though. On one hand it is true that as a basis for the polynomials the levels are treated as equally spaced. On the other hand, as explained above, this does not restrict the modelling flexibility. The question of interest addressed by polynomial contrasts is what functional form can describe the connection of the ordinal levels with the outcome. One could argue that if a linear term turns out to be enough, the regression behaves as if an even spacing of the ordinal categories makes sense. In this vein, a use could be if it is of interest whether categories work in an evenly spaced linear way, rather than saying that we have to assume in advance that they do. One could for example be interested in the relation between a patient's assessment of their well-being on a scale between 1 and 10 and some measured indicators. One may suspect that the patient's scoring is meant to be in some sense evenly spaced (they are asked for integer numbers after all), but of course there is no calibration, and the score is not in any well defined sense interval-scaled. Using polynomial coding, it may or may not turn out that the relationship between these scores and the measurements is roughly linear, and if this is so, it looks like the evenly spaced integer numbers express well how the patients have used the scale (at least if this isn't very noisy, i.e., low $R^2$).

If, however, a linear term is not enough, in most cases I'd take the polynomial contrasts just as one representation of a flexible categorial model, rather than interpreting them as a squared, cubic etc. relationship between evenly spaced categories. The best thing beyond interpreting a "linear only" relationship is probably to make some qualitative statements about the fitted function (monotonicity, number of local optima, minimum where etc.). Such interpretations do not require categories to be evenly spaced, although I don't want to exclude that in some situations one can come up with a meaningful interpretation of a squared relationship, say.

I should also say that if only qualitative interpretations are given, I don't think that the polynomial coding is more useful than other standard codings, e.g., those that allow to interpret category effects relative to the grand mean or the lowest category, unless the number of categories is high (e.g., a 1-100 rating scale), in which case it may be informative that a low degree polynomial can fit the data well even if nonlinear.

  • Thanks for the answer. Just tested this with a linear regression mpg ~ cyl on the R builtin dataset mtcars, and verified that indeed applying "polynomial contrast" to cyl is indeed identical to treating it as a category. And treating cyl as numeric instead leads to a (slightly) smaller $R^2$ and thus a poorer fit, because a linear link is assumed. – cdalitz Aug 24 '21 at 10:48
  • In the frequentist world we don't have many methods tailored for ordinal predictors (unlike the case for ordinal Y). There are elegant Bayesian solutions. The R brms package's brm function automatically properly handles an ordered factor predictor by putting priors on the spacings of the coefficients of the $k-1$ indicator variables for $k$ levels. – Frank Harrell Aug 24 '21 at 12:16
  • @frank-harrell Thanks for the hint to brms. Is the R package ordPens not also specifically written to address ordinal predictors? – cdalitz Aug 25 '21 at 12:22
  • It does. Thanks for letting me know about that package. The question I have is whether it handles mixtures of ordinal and non ordinal variables. Also how can you do regression in general with it, e.g. ordinal Y model or survival analysis? – Frank Harrell Aug 25 '21 at 14:33
  • 1
    @frank-harrell ordSmooth allows for a mixture of ordinal, categorial (parameter u) and metric (parameter z) predictors. Moreover it also supports logistic regression, so in principle it should also work for ordinal respones, but I do not see that this is already builtin. – cdalitz Aug 25 '21 at 17:33