6

I am new to stats and I want to use a regression to determine income

See below table example

age class location income
23 Adult London 23000
44 Adult Glasgow 45000
75 Senior Birmingham 37000
12 Child Coventry 300

I know the different stats packages (python, r) create something like this when they have to deal with categorical data :

age Adult Senior Child
23 1 0 0
44 1 0 0
75 0 1 0
12 0 0 1

Why can't we use something like this instead when we're doing a regression ?

age class location
23 1 1
44 1 2
75 2 3
12 3 4

The reason I am asking is because location has a lot of different values and those are converted to like 30 something columns

  • Your problem seems to be that you have a factor with many possible levels. See the tag [tag:many-categories] and, for instance, https://stats.stackexchange.com/questions/146907/principled-way-of-collapsing-categorical-variables-with-many-levels – kjetil b halvorsen Jul 09 '22 at 16:14

3 Answers3

8

With a categorical variable you often want to estimate a different value for each category.

For example:

$$\text{estimate} = \beta_{adult} \cdot x_{adult} + \beta_{senior} \cdot x_{senior} + \beta_{child} \cdot x_{child} $$

which you can also read as

$$\text{estimate} = \begin{cases} \beta_{adult} & \quad \text{if } x_{adult} = 1, x_{senior}=0, x_{child} =0\\ \beta_{senior} & \quad \text{if } x_{adult} = 0, x_{senior}=1, x_{child} =0\\ \beta_{child} & \quad\text{if } x_{adult} = 0, x_{senior}=0, x_{child} =1\\ \end{cases}$$


Your suggestion would treat the categorical variable as a numerical variable which only allows to estimate a single value and makes that the estimated for the different groups are dependent

$$\text{estimate} = \beta_{class} \cdot x_{class} $$

which is like

$$\text{estimate} = \begin{cases} 1\cdot \beta_{class} & \quad \text{if } x_{class} = 1 \\ 2\cdot \beta_{class} & \quad \text{if } x_{class} = 2 \\ 3\cdot \beta_{class} & \quad \text{if } x_{class} = 3 \\ \end{cases}$$


The reason I am asking is because location has a lot of different values and those are converted to like 30 something columns

The reason for the dummy variables and those many additional columns is because in the underlying algebra that estimates the 30 parameters you need a column for each parameter that you estimate. The linear formula multiplies each parameter with a column. If you predict more parameters (like for a categorical parameter) then you need to convert each category into their own column.

8

Let's focus just on location. You are suggesting this model for income as a function of location:

$$I_i = a + b L_i + \varepsilon_i$$

Then, the expected increase in income between Glasgow (location 2) and London (location 1) is:

$$(a + b \times 2) - (a + b \times 1) = b$$

The expected increase in income between Birmingham and Glasgow is also $b$, as is the one between Coventry and Birmingham. You only have one parameter for this so you've imposed some amount of structure on this effect.

However, the location index is immaterial; we could have assigned them in alphabetical order instead (or in any other order). If we estimate this model in alphabetical order, we'll get a different $b$, and the difference between Coventry and Birmingham is still $b$, but now the difference between Birmingham and Glasgow is $-2b$, so it's not the same as between Coventry and Birmingham (well, unless $b$ happens to be exactly zero). So, this is an entirely different structure that has been imposed.

By using separate parameters (or columns) for each location, you allow each one to have its own, location-specific effect, rather than impose all sorts of restrictions. Usually, this is what you want.

Chris Haug
  • 5,785
7

Why can't we use something like this instead when we're doing a regression ?

You can, but you have to be aware of how to use it in your model.

In your example you assign integers to each distinct value of a categorical variable, like class or location. A major issue with this that comes to mind is that this implies an order to the data that is not necessarily true.
For class an ordering like

  1. child
  2. adult
  3. senior

might work in the sense of an age order. If we regard age, the expression child < adult < senior (1 < 2 < 3) is true.
What about location? Consider these example expressions:

  • London > Glasgow
  • Birmingham < Coventry

While there are causal setups in which one could frame such orders their usage should be very explicitly explained and cautiously used.

I want to elaborate on in the context of regression.
Consider a linear regression model of the above data of the form

$$\mathrm{income} = \beta_0 + \beta_1 \mathrm{class} + \beta_2 \mathrm{location}$$

Let's say after fitting this model you inspect its parameters and find that the estimate of $\beta_1$ is, let's say, 13000. With all other things being equal, an increase of class by 1 unit therefore corresponds to an increase in income by 13000 units. By applying the ordering suggested above, this becomes interpretable. This would have been much less so in the case of an ordering like you listed it, 1=adult, 2=senior, 3=child.
There are categorical variables which might not even have an identifieable ordering like this, in which case such an approach to incorporating the data in your model is confusing at best. This might just be the case with location.

However such a procedure is not at all to be discarded, just to be used differently than the standard approach shown above for numeric variables.

You can use an integer mapping of these categorical variables, but in a particular way. Let's revisit the above regression model again, but in a form that uses the categorical variables differently:

$$\mathrm{income} = \beta_0 + \beta_{1,class} + \beta_{2,location}$$

With $class$ and $location$ here being the integer version of your categorical features. Here, $\beta_1$ and $\beta_2$ are vectors that contain a separate parameter value per distinct value of the respective variables. Here the interpretation shifts from the above coefficient to more of a category-specific intercept, a change in the income if e.g. class=adult. class and location are indicator variables in this case.

Note that this is in principle the same as it would be with a dummy encoding. In the model

$$\mathrm{income} = \beta_0 + \beta_1 adult + \beta_2 senior + ...$$

where every distinct value of the categorical variables its own term in the model with its own coefficient (leaving out one of them), the coefficients take on the meaning of a relative change if e.g. adult=1 .

deemel
  • 2,704
  • 4
  • 22
  • 41