0

The response variable in the dataset is highly skewed with a "ceiling effect". The errors of a fitted regression model, will thus also be skewed. I tried to fit a regression but as expected errors were not normally distributed. I doubt that transformation will help much.

enter image description here
The dependent variable (DV) ranges from 1 (not probable) to 7 (highly probable). There are 1197 observations in the dataset. There are about 40 predictors in the dataset, most of them categorical, ordinal or interval scaled. The predictor that interests the most is categorical with 8 levels.

I have looked at different options, but am unsure, which is the smartest one.

Option A: Tobit Models I have just started familiarizing myself with Tobit Regression. From what I understand, they estimate two models. One estimates whether or not participants choose 7. The second model fits the model with coefficients for the dependent variable. Initial results are o.k. and I can interpret them, but comparing the diagnostic plot with the tutorial from UCLA I am not sure whether the model is robust? enter image description here

Option B: Ordinal Regression I can treat the dependent variable as ordinal or interval scaled. Treating it as an ordinal variable, I can fit an ordinal regression. But how robust is an ordinal regression to a skewed distribution? Questions on StackExchange about this issue have no real answer or none at all ( question1 , question 2, question 3 )

Option C: In this paper "two-part models" are mentioned. Is there another two-part model in addition to Tobit Regression that would be suitable for this type of data?


There are the 3 options I have come across so far. Is there any other alternative?

Thanks!

Simone
  • 252
  • 1
  • 9
  • 1
    On this information, ordinal regression is easily the best choice. – Nick Cox Sep 25 '23 at 16:32
  • 2
    40 predictors is way too many; some of your levels of the DV have less than 100 cases. So, whichever approach you choose (and I agree with @NickCox) you will have to be careful building your model. LASSO might be good. – Peter Flom Sep 25 '23 at 16:46
  • 1
    Tobit models are meant for censored dependent variables, but in your case there does not appear to be any issue with censoring. – Durden Sep 25 '23 at 20:36
  • @PeterFlom very true, that is why I said what the main focus is (categorical with 8 levels). As a next step we then want to find boundary conditions, hence we collected additional information (variables). – Simone Sep 26 '23 at 06:59
  • @Durden yes, Tobit has a conceptual aspect, for which I can argue why this might be the case. But there is also a technical difference: 2 models are estimated. So very true - it is not just about fitting a model that provides some results. Hence why I posted the question – Simone Sep 26 '23 at 07:04
  • See https://stats.stackexchange.com/questions/146533/versus-vs-how-to-properly-use-this-word-in-data-analysis on the use of versus. In my reading, the wording residual vs fitted is much more common than the opposite, so that what is on the y or vertical axis is mentioned first. – Nick Cox Sep 26 '23 at 08:36
  • 1
    @Simone You're right, Tobit is an instance of a two-part (likelihood) model: a Gaussian part that models the non-censored dependent variable, and a Probit part that models the censoring itself. This specific setup doesn't seem appropriate in your case, but the general principle of combining two (or more) probability models extends beyond Tobit (McElreath has a nice lecture on this). In your case, if you wanted to go that route, you could combine an ordinal model with a binary model that accounts for the relatively many 1s and 7s in your data. – Durden Sep 26 '23 at 15:28

1 Answers1

1

The usual model for ordinal logistic regression does not make any assumptions about whether the DV is skewed or not. It does assume proportional odds. If this assumption is violated, there are various remedies. The most straightforward is to use multinomial logistic regression, but this generates a lot of parameters.

Another possibility is partial proportional odds. This is available in R (your graphs look like R graphs) see here.

Another worry, as I mentioned in a comment, is the large number of predictor variables. You will want to be careful in how you build your model.

Peter Flom
  • 119,535
  • 36
  • 175
  • 383