Simple linear regression with a numeric and a categorical data

Question

I try to perform a linear regression on these data:

data=data.frame(y=c(-1, -2, -3, -4, -5, -6, -7, -8, -9, -10, 1, 0.5, 0, -0.5, -1, -1.5, -2, -2.5, -3, -3.5, -4, -4.5), 
            sexe=c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1),
            age= c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12))

There is simple relationship with y and age when sexe is known

I thought that the glm is able to "capture" the relationship (since the predictor can be categorical or numeric, Regression for categorical independent variables and a continuous dependent one).

Since the relationship is very simple, I would like to know if I did something wrong, or the glm is not adapted.

Isabella Ghement · Accepted Answer · 2018-06-01T00:06:47.687

For a linear regression model relating y (continuous variable) to sexe and age, you would actually need to use the lm() function like so:

model1 <- lm(y ~ sexe + age, data = data) 
summary(model1)

The above model assumes that the effect of age on y is the same for both values of sexe. To fit a model which allows for the effect of age to be different across the two values of sexe, you can use this syntax:

model2 <- lm(y ~ sexe*age, data = data) 
summary(model2)

To determine which of the two models is supported by your data, you can perform an ANOVA F-test:

anova(model1, model2)

If the p-value for this test is smaller than your pre-selected significance level alpha (e.g., alpha = 0.05), then the data provide evidence that the effect of age differs across values of sexe.

The glm() function is better suited for models where the outcome variable y may be a count variable, or a binary variable with values 0 and 1, or a categorical variable (nominal or ordinal), etc.

Addendum:

When you fit each of the two models described above, model1 and model2, it's not a bad idea for you to check the variance inflation factor (vif) for each term in the model.

install.packages("car")
library(car) 

vif(model1)
vif(model2)

When you do so, here is what you get for model1:

> vif(model1)
    sexe      age 
1.024189 1.024189

and for model2:

> vif(model2)
    sexe      age sexe:age 
4.611570 2.799449 7.138292 
Warning message:
In summary.lm(object) : essentially perfect fit: summary may be unreliable

Notice the warning posted by R, which suggests that the summary reported for model2 may be unreliable, and also the large vif for the interaction term sexe:age in model2. You might have to discard model2 and stick with model1 for these data, even though the p-value corresponding to the ANOVA F-test is statistically significant.

Ignoring the issues with model2 for now, here's a quick way to get the plot produced by Sal in his answer:

install.packages("sjPlot")
install.packages("sjmisc")


library(sjPlot)
library(sjmisc)

sjp.int(model2, type = "eff")

You can also get better formatted output for your models using these commands:

sjt.lm(model1)
sjt.lm(model2)

sjt.lm(model1, model2, 
      depvar.labels = c("y", "y"))

thank you @Isabella. What if I have more variables, either categorical or numeric, with more complex relationships between them. Can we do : y ~ sexe*age*var1*var2 or does it make sense to test y ~ sexe*age*(var1+var2) ? Also, do you know if we can use other algorithms, randomForest, GBM? — John Smith, Jun 03 '18 at 08:52

score 2 · Answer 2 · answered May 31 '18 at 17:18

This answer is a followup to that from @IsabellaGhement .

The output from summary(model2) contains the information necessary to determine the best fit lines for each of sexe. But it is not always obvious how to translate this information into the intercepts and slopes for the individual lines.

Below, I1 and I2 represent the intercepts, and B1 and B2 represent the slopes. For clarity, I copied the numbers how they appeared in my summary(model2) output.

Note: In the plot command, I had to use data$sexe + 1 because the color for 0 would be white.

I.nought = 3.409e-15
I1 = I.nought
I2 = I.nought + 1.500e+00
B1  = -1.000e+00
B2  = -1.000e+00 * 5.000e-01

plot(x   = data$age,
     y   = data$y,
     col = data$sexe + 1,
     pch = 16,
     xlab = "age",
     ylab = "y")

legend('bottomleft',
       legend = levels(factor(data$sexe)),
       col = 1:2,
       cex = 1,   
       pch = 16)

abline(I1, B1,
       lty=1, lwd=2, col = 1)

abline(I2, B2,
       lty=1, lwd=2, col = 2)

Great follow-up to my post, Sal! Thank you! – Isabella Ghement May 31 '18 at 22:18 — Isabella Ghement, May 31 '18 at 22:18

Simple linear regression with a numeric and a categorical data

2 Answers2