How to get probabilities with regression

Question

I have data like this:

group   length
1       5
1       5
1       2
1       3
1       5
1       5
1       3
1       2
1       5
1       3
2       3
2       3
2       3
2       3
2       5
2       2
2       5
2       3
2       3
2       3

I would like to get the probability of length being each of the values length takes on (2, 3, 5) separately for group. I would like to get this with regression. Transformations of the data are fine if required. I am using Stata right now, but any explanation/pseudo-code is greatly appreciated.

To illustrate what I mean, here is how I would do this manually:

*1. Transform the data to get counts by length for each group, calculate total, and calculate probabilities:

group   length_2_N  length_3_N  length_5_N Total prob_2  prob_3  prob_5
1       2           3           5          10    .2      .3      .5  
2       1           7           2          10    .1      .7      .2

What I want is to get the .2, .3, .5 and .1, .7, .2 from a regression. It is fine if I need to split the data by group and run two regressions. Any hints?

I think that I basically am wanting to get P(length = x) = $\alpha$, where x = {2,3,5} (for each group). Additionally it would be useful to estimate P(length = x) = $\alpha$ + $\beta$ group.

Could you explain why you must use some kind of regression procedure to perform a calculation that you can already do by other reasonably simple means, as you have shown? (There can be several legitimate reasons for this, but I don't want to mislead you by suggesting what I think they could be.) — whuber, Jun 02 '15 at 20:27
Sure, @whuber. Suppose that group refers to an interval of time, for example a month. Suppose that I actually have many groups. In my regression, I would also like to fit a linear time trend so that I can extrapolate for future months. — bill999, Jun 02 '15 at 21:18
It is now quite unclear what you are trying to do. Could you perhaps edit your post to show us what the answer might look like with such a regression? — whuber, Jun 02 '15 at 21:24
Sorry for being unclear. The main thing I want to do is described in the bulk of the question - simply estimating probabilities and not considering the time trend thing at all.
The last sentence, where I said it would also be useful to estimate P(length = x) = $\alpha$ + $\beta$ group is referring to adding the time trend into the regression. — bill999, Jun 02 '15 at 21:27
I think I follow. You may be going a little astray at the end by supposing the probability should be a linear function of group, especially if group later will represent a time: such models tend to forecast (or even fit) mathematically invalid probabilities eventually. The standard solution is multinomial regression. Perhaps that's what you're looking for? (I describe this model and interpret its coefficients in the second half of an answer at http://stats.stackexchange.com/a/17203.) — whuber, Jun 02 '15 at 21:32

Jake · Answer 1 · 2015-06-02T21:34:58.213

This might not be what you're looking for, but have you considered logistic regression?

Check out this page for some quick info on it. I know it's pretty easy to do in R, so I'm sure it would be easy to do in Stata (I don't use Stata though)

http://en.wikipedia.org/wiki/Logistic_regression

The key part of logistic regression is that you explanatory variable(i.e. your group) must be categorical and only have two levels. Based on your data set above, this is true, but if you plan on adding more groups, then logistic regression won't apply. Using this type of regression, you can calculate probabilities and "log" odds with as much covariate (categorical, numerical) information as you want, so long as you fit the above assumptions.

Hope this helps!

You're on the right track. But please see http://stats.stackexchange.com/questions/60087 and http://stats.stackexchange.com/questions/52104/. — whuber, Jun 02 '15 at 21:33

dimitriy · Answer 2 · 2015-06-02T22:42:38.890

Here's how I might do this in Stata, treating group as a categorical variable and using a Poisson model with a robust variance covariance errors:

. clear

. input group   length

         group     length
  1. 1       5
  2. 1       5
  3. 1       2
  4. 1       3
  5. 1       5
  6. 1       5
  7. 1       3
  8. 1       2
  9. 1       5
 10. 1       3
 11. 2       3
 12. 2       3
 13. 2       3
 14. 2       3
 15. 2       5
 16. 2       2
 17. 2       5
 18. 2       3
 19. 2       3
 20. 2       3
 21. end

. poisson length i.group, robust

Iteration 0:   log pseudolikelihood = -34.379996  
Iteration 1:   log pseudolikelihood = -34.379996  

Poisson regression                              Number of obs     =         20
                                                Wald chi2(1)      =       1.04
                                                Prob > chi2       =     0.3086
Log pseudolikelihood = -34.379996               Pseudo R2         =     0.0051

------------------------------------------------------------------------------
             |               Robust
      length |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     2.group |  -.1410786   .1385692    -1.02   0.309    -.4126692     .130512
       _cons |   1.335001   .1066392    12.52   0.000     1.125992     1.54401
------------------------------------------------------------------------------

. margins group, predict(p(2)) predict(p(3)) predict(p(5))

Adjusted predictions                            Number of obs     =         20
Model VCE    : Robust

1._predict   : Pr(length=2), predict(p(2))
2._predict   : Pr(length=3), predict(p(3))
3._predict   : Pr(length=5), predict(p(5))

--------------------------------------------------------------------------------
               |            Delta-method
               |     Margin   Std. Err.      z    P>|z|     [95% Conf. Interval]
---------------+----------------------------------------------------------------
_predict#group |
          1 1  |    .161517   .0310033     5.21   0.000     .1007517    .2222823
          1 2  |   .2008288   .0231013     8.69   0.000     .1555512    .2461065
          2 1  |   .2045882   .0174537    11.72   0.000     .1703796    .2387968
          2 2  |   .2209117   .0058642    37.67   0.000     .2094182    .2324053
          3 1  |   .1477127   .0189024     7.81   0.000     .1106647    .1847606
          3 2  |   .1202864   .0180939     6.65   0.000      .084823    .1557498
--------------------------------------------------------------------------------

For example, $\Pr(length=5 \vert group=2)=.1202864$

Could you please explain what the underlying model is and what this code is doing? Otherwise it is meaningless to anyone who is not using your (unnamed) software (which I guess is Stata). — whuber, Jun 02 '15 at 22:33

How to get probabilities with regression

2 Answers2