0

I have a data set that df looks like this (with about 18 million rows):

    Index         A      B                       C       D             E        Target
-----------------------------------------------------------------------------------
    1             1      0.006848103             1       0             0        1
    2             1      0.003511620             0       0             0        0
    3             0      0.008068291             0       0             1        1
    4             1      0.020320609             0       1             0        0
    5             0      0.012538248             0       0             1        0
    6             0      0.006917654             0       0             0        1
    7             1      0.015234864             0       0             1        0
    8             1      0.007661562             0       1             0        0
    9             1      0.036132621             0       0             0        0
    10            0      0.009288480             0       0             1        0

Once I run glm on it I get the following summary of my model:

Call:
glm(formula = Target ~ ., family = "binomial", data = df)

Deviance Residuals: Min 1Q Median 3Q Max
-1.2869 -0.4814 -0.4442 -0.4311 2.3110

Coefficients: (3 not defined because of singularities) Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.106539 0.132804 -15.862 < 2e-16 *** A2 0.630222 0.331437 1.901 0.05724 .
A3 2.633972 1.044464 2.522 0.01167 *
A5 -11.957095 535.411224 -0.022 0.98218
A18 14.618893 535.411379 0.027 0.97822
A81 13.101479 535.412637 0.024 0.98048
B 2.809450 1.271685 2.209 0.02716 *
C1 0.317448 0.171471 1.851 0.06412 .
C2 -12.139551 535.411313 -0.023 0.98191
D1 -0.034547 0.153491 -0.225 0.82192
D2 0.001097 1.232556 0.001 0.99929
D14 NA NA NA NA
D18 NA NA NA NA
D80 NA NA NA NA
E1 -0.238323 0.143562 -1.660 0.09690 .
E2 -0.270229 0.627636 -0.431 0.66680


Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 6445.4  on 9756  degrees of freedom

Residual deviance: 6379.6 on 9736 degrees of freedom AIC: 6421.6

Number of Fisher Scoring iterations: 12

Why does glm create new variables like this? What does A18 and A81 even mean?

EDIT: Added the structure of the data frame.

> str(grouped)
'data.frame':   9757 obs. of  10 variables:
 $ A     : Factor w/ 6 levels "1","2","3","15",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ B     : num  0.00685 0.00351 0.00807 0.02032 0.01254 ...
 $ C     : Factor w/ 3 levels "0","1","2": 2 1 1 1 1 1 1 1 1 1 ...
 $ D     : Factor w/ 6 levels "0","1","2","14",..: 1 1 1 2 1 1 1 2 1 1 ...
 $ E     : Factor w/ 3 levels "0","1","2": 1 1 2 1 2 1 2 1 1 2 ...
 $ Target: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
Parseval
  • 363
  • 3
  • 9
  • 1
    Is your A variable a factor? – user2974951 Nov 10 '21 at 13:50
  • @user2974951 - Yes, with 6 levels. – Parseval Nov 10 '21 at 13:54
  • 1
    What are its levels? levels(df$A)? – user2974951 Nov 10 '21 at 13:55
  • Can you please share the str(dataset) it will make things easier to understand. – Guilherme Marthe Nov 10 '21 at 13:56
  • @user2974951 - Ok I see where you're going with this. The levels are 2,3,5,18 and 81 indeed just like the suffixes. But do I really want this? – Parseval Nov 10 '21 at 13:57
  • @GuilhermeMarthe - Done! – Parseval Nov 10 '21 at 14:00
  • You have 18 million rows and use this for glm? – Sextus Empiricus Nov 10 '21 at 14:06
  • I'm quite new to this. I have a very sparse data set (many zeroes) where the target is binary categorical (0/1) and the predictors are numerical, discrete and continous in very different scales. What would a good algorithm in R for classification here be? What would you suggest? – Parseval Nov 10 '21 at 14:10
  • 2
    @Parseval it starts with exploring the data and using an understanding of what the data means. For this reason, if you ask another person what to do with the data, then you need to explain it. -------- Typically, if you have sparse data then you can not just treat the absence of data as a value of zero. If you want to treat the variables as continuous variables then add a dummy variable for the absence of data or otherwise you get this effect of absent data being effectively a zero. But to go more indepth into the topic it is better to know more about the data. – Sextus Empiricus Nov 10 '21 at 14:33
  • The target variable is whether the customer has placed a purchase in a certain product category given the predictors, where A,C,D,E are whether the customer previously placed at least one order in categories A,C,D,E. The variable B is simply the total amount of money the customer spent the last year, note that this variable is scaled to be inbetween 0 and 1. So naturally, the zeroes are not actually missing values but they mean something. – Parseval Nov 10 '21 at 14:40

1 Answers1

1

Why does glm create new variables like this? What does A18 and A81 even mean?

This occurs when the variables are interpreted as categorical data. The glm procedure will convert the categorical variable into dummy variables, one for each category (also known as dummy encoding).

Maybe those variables are supposed to be categorical. But if not, then the 'error' might have occured with the data input. The interpretation of the variables as categorical data may occur when R reads the data from a file. The function does not automatically interpret numbers as continuous variables and sees each new number as text and turns it into a categorical variable. (If the data is only numbers then the reading does not interpret the numbers as text)

  • Well, they were numerical first, but then I turned them into factors since I only had 0 and 1. When I ran glm with the predictors as numericals I got some rank-deficient error and I assume it has to do with co-linearity or something. Maybe I should just start a new question and state the original prolbem. – Parseval Nov 10 '21 at 14:11
  • 1
    The data in university was much nicer than real world data. Nothing works when dealing with real world data it seems. – Parseval Nov 10 '21 at 14:12