I have a data set that df looks like this (with about 18 million rows):
Index A B C D E Target
-----------------------------------------------------------------------------------
1 1 0.006848103 1 0 0 1
2 1 0.003511620 0 0 0 0
3 0 0.008068291 0 0 1 1
4 1 0.020320609 0 1 0 0
5 0 0.012538248 0 0 1 0
6 0 0.006917654 0 0 0 1
7 1 0.015234864 0 0 1 0
8 1 0.007661562 0 1 0 0
9 1 0.036132621 0 0 0 0
10 0 0.009288480 0 0 1 0
Once I run glm on it I get the following summary of my model:
Call:
glm(formula = Target ~ ., family = "binomial", data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.2869 -0.4814 -0.4442 -0.4311 2.3110
Coefficients: (3 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.106539 0.132804 -15.862 < 2e-16 ***
A2 0.630222 0.331437 1.901 0.05724 .
A3 2.633972 1.044464 2.522 0.01167 *
A5 -11.957095 535.411224 -0.022 0.98218
A18 14.618893 535.411379 0.027 0.97822
A81 13.101479 535.412637 0.024 0.98048
B 2.809450 1.271685 2.209 0.02716 *
C1 0.317448 0.171471 1.851 0.06412 .
C2 -12.139551 535.411313 -0.023 0.98191
D1 -0.034547 0.153491 -0.225 0.82192
D2 0.001097 1.232556 0.001 0.99929
D14 NA NA NA NA
D18 NA NA NA NA
D80 NA NA NA NA
E1 -0.238323 0.143562 -1.660 0.09690 .
E2 -0.270229 0.627636 -0.431 0.66680
Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 6445.4 on 9756 degrees of freedom
Residual deviance: 6379.6 on 9736 degrees of freedom
AIC: 6421.6
Number of Fisher Scoring iterations: 12
Why does glm create new variables like this? What does A18 and A81 even mean?
EDIT: Added the structure of the data frame.
> str(grouped)
'data.frame': 9757 obs. of 10 variables:
$ A : Factor w/ 6 levels "1","2","3","15",..: 1 1 1 1 1 1 1 1 1 1 ...
$ B : num 0.00685 0.00351 0.00807 0.02032 0.01254 ...
$ C : Factor w/ 3 levels "0","1","2": 2 1 1 1 1 1 1 1 1 1 ...
$ D : Factor w/ 6 levels "0","1","2","14",..: 1 1 1 2 1 1 1 2 1 1 ...
$ E : Factor w/ 3 levels "0","1","2": 1 1 2 1 2 1 2 1 1 2 ...
$ Target: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
levels(df$A)? – user2974951 Nov 10 '21 at 13:55str(dataset)it will make things easier to understand. – Guilherme Marthe Nov 10 '21 at 13:56