Why does glm in R create new variables upon training?

Question

I have a data set that df looks like this (with about 18 million rows):

    Index         A      B                       C       D             E        Target
-----------------------------------------------------------------------------------
    1             1      0.006848103             1       0             0        1
    2             1      0.003511620             0       0             0        0
    3             0      0.008068291             0       0             1        1
    4             1      0.020320609             0       1             0        0
    5             0      0.012538248             0       0             1        0
    6             0      0.006917654             0       0             0        1
    7             1      0.015234864             0       0             1        0
    8             1      0.007661562             0       1             0        0
    9             1      0.036132621             0       0             0        0
    10            0      0.009288480             0       0             1        0

Once I run glm on it I get the following summary of my model:

Call:
glm(formula = Target ~ ., family = "binomial", data = df)
Deviance Residuals: 
    Min       1Q   Median       3Q      Max

-1.2869  -0.4814  -0.4442  -0.4311   2.3110
Coefficients: (3 not defined because of singularities)
                   Estimate Std. Error z value Pr(>|z|)

(Intercept)       -2.106539   0.132804 -15.862  < 2e-16 ***
A2                 0.630222   0.331437   1.901  0.05724 .

A3                 2.633972   1.044464   2.522  0.01167 *

A5                -11.957095 535.411224  -0.022  0.98218

A18                14.618893 535.411379   0.027  0.97822

A81                13.101479 535.412637   0.024  0.98048

B                  2.809450   1.271685   2.209  0.02716 *

C1                 0.317448   0.171471   1.851  0.06412 .

C2                -12.139551 535.411313  -0.023  0.98191

D1                -0.034547   0.153491  -0.225  0.82192

D2                 0.001097   1.232556   0.001  0.99929

D14                NA         NA      NA       NA

D18                NA         NA      NA       NA

D80                NA         NA      NA       NA

E1                -0.238323   0.143562  -1.660  0.09690 .

E2                -0.270229   0.627636  -0.431  0.66680

Signif. codes:  0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 6445.4  on 9756  degrees of freedom

Residual deviance: 6379.6  on 9736  degrees of freedom
AIC: 6421.6
Number of Fisher Scoring iterations: 12

Why does glm create new variables like this? What does A18 and A81 even mean?

EDIT: Added the structure of the data frame.

> str(grouped)
'data.frame':   9757 obs. of  10 variables:
 $ A     : Factor w/ 6 levels "1","2","3","15",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ B     : num  0.00685 0.00351 0.00807 0.02032 0.01254 ...
 $ C     : Factor w/ 3 levels "0","1","2": 2 1 1 1 1 1 1 1 1 1 ...
 $ D     : Factor w/ 6 levels "0","1","2","14",..: 1 1 1 2 1 1 1 2 1 1 ...
 $ E     : Factor w/ 3 levels "0","1","2": 1 1 2 1 2 1 2 1 1 2 ...
 $ Target: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...

Can you please share the str(dataset) it will make things easier to understand. — Guilherme Marthe, Nov 10 '21 at 13:56
@user2974951 - Ok I see where you're going with this. The levels are 2,3,5,18 and 81 indeed just like the suffixes. But do I really want this? — Parseval, Nov 10 '21 at 13:57
I'm quite new to this. I have a very sparse data set (many zeroes) where the target is binary categorical (0/1) and the predictors are numerical, discrete and continous in very different scales. What would a good algorithm in R for classification here be? What would you suggest? — Parseval, Nov 10 '21 at 14:10
@Parseval it starts with exploring the data and using an understanding of what the data means. For this reason, if you ask another person what to do with the data, then you need to explain it. -------- Typically, if you have sparse data then you can not just treat the absence of data as a value of zero. If you want to treat the variables as continuous variables then add a dummy variable for the absence of data or otherwise you get this effect of absent data being effectively a zero. But to go more indepth into the topic it is better to know more about the data. — Sextus Empiricus, Nov 10 '21 at 14:33
The target variable is whether the customer has placed a purchase in a certain product category given the predictors, where A,C,D,E are whether the customer previously placed at least one order in categories A,C,D,E. The variable B is simply the total amount of money the customer spent the last year, note that this variable is scaled to be inbetween 0 and 1. So naturally, the zeroes are not actually missing values but they mean something. — Parseval, Nov 10 '21 at 14:40

score 1 · Answer 1 · answered Nov 10 '21 at 14:02

Why does glm create new variables like this? What does A18 and A81 even mean?

This occurs when the variables are interpreted as categorical data. The glm procedure will convert the categorical variable into dummy variables, one for each category (also known as dummy encoding).

Maybe those variables are supposed to be categorical. But if not, then the 'error' might have occured with the data input. The interpretation of the variables as categorical data may occur when R reads the data from a file. The function does not automatically interpret numbers as continuous variables and sees each new number as text and turns it into a categorical variable. (If the data is only numbers then the reading does not interpret the numbers as text)

Well, they were numerical first, but then I turned them into factors since I only had 0 and 1. When I ran glm with the predictors as numericals I got some rank-deficient error and I assume it has to do with co-linearity or something. Maybe I should just start a new question and state the original prolbem. — Parseval, Nov 10 '21 at 14:11
The data in university was much nicer than real world data. Nothing works when dealing with real world data it seems. — Parseval, Nov 10 '21 at 14:12

Why does glm in R create new variables upon training?

1 Answers1

Linked