0

I'm using statsmodels to perform an Ordinary Least Squares regression between two categorical variables and a continuous dependent variable.

My data is structured like this:

Group    Model    Rate 
----------------------
Group A  Model 1  1.3
Group B  Model 7  0.43
Group B  Model 1  0.77
Group G  Model 2  3.2

I'm trying to correlate the group and machine variables with the rate outcome.

I've used statsmodels.formula.api.ols to create the model, but after fitting it, the result doesn't seem to contain all values of my categorical variables.

This is how I created and fit the model:

model = statsmodels.formula.api.ols('rates ~ C(models) + C(groups)', data=df)
fitted_model = model.fit()

The result looks good and makes sense, except the missing values. I inspected the result by looking at the fitted_model.params. It lists all levels of my "Group" variable but one, and same for the "Model". It also gives an "Intercept".

I'm guessing my issue is statistical, rather than coding. Is there a reason one level of each categorical variable would be elided? If I'm interested in the effect of those missing levels on my outcome (coefficients, p-values), how can I find that out?

Nick S
  • 101
  • 1
    It's coding: search our site for "dummy" (an unfortunate name for these variables!) or "categorical" with "coding" for explanations. – whuber Sep 21 '20 at 22:04
  • @whuber Right, right, I do know about encoding categorical variables, but it's been a while since I read about it. So that's why one value is being omitted, but then how could I calculate some number for its correlation or p-value? – Nick S Sep 21 '20 at 22:16
  • That's a matter of telling your software either to use a different coding or following up with explicit testing, which again is a software issue. – whuber Sep 22 '20 at 13:04

0 Answers0