0

Preparation

Using R-Libraries: library(dplyr)


The situation

Data

Given the data

my_data <- mtcars |>
 mutate(vs = factor(vs,
                    levels = c(0, 1),
                    labels = c('V-shaped', 'straight'))) |>
 select(mpg, hp, vs)

which results in

> my_data
                     mpg  hp       vs
Mazda RX4           21.0 110 V-shaped
Mazda RX4 Wag       21.0 110 V-shaped
Datsun 710          22.8  93 straight
Hornet 4 Drive      21.4 110 straight
Hornet Sportabout   18.7 175 V-shaped
Valiant             18.1 105 straight
Duster 360          14.3 245 V-shaped
Merc 240D           24.4  62 straight
Merc 230            22.8  95 straight
Merc 280            19.2 123 straight
Merc 280C           17.8 123 straight
Merc 450SE          16.4 180 V-shaped
Merc 450SL          17.3 180 V-shaped
...

Note, that vs has exactly two values, V-shaped and straight.

Regressions

Calculating the estimate of hp for groups V-shaped and straight can be done seperately:

reg_v <- my_data |>
 filter(vs == 'V-shaped') |>
 lm(mpg ~ hp, data = _)

giving

> summary(reg_v)

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 24.49637 2.42004 10.122 2.32e-08 *** hp -0.04153 0.01219 -3.408 0.0036 **

and

reg_st <- my_data |>
 filter(vs == 'straight') |>
 lm(mpg ~ hp, data = _)

giving

> summary(reg_st)

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 39.00055 4.17535 9.341 7.45e-07 *** hp -0.15810 0.04426 -3.572 0.00384 **

We can also specify an interaction:

reg_inter1 <- my_data |>
 lm(mpg ~ hp*vs, data = _)

returning

> summary(reg_inter1)

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 24.49637 2.73893 8.944 1.07e-09 *** hp -0.04153 0.01379 -3.011 0.00547 ** vsstraight 14.50418 4.58160 3.166 0.00371 ** hp:vsstraight -0.11657 0.04130 -2.822 0.00868 **

When recoding the factor variable

reg_inter2 <- my_data |>
 mutate(vs = fct_rev(vs)) |>
 lm(mpg ~ hp*vs, data = _)

the following is returned

> summary(reg_inter2)

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 39.00055 3.67278 10.619 2.52e-11 *** hp -0.15810 0.03893 -4.061 0.000357 *** vsV-shaped -14.50418 4.58160 -3.166 0.003713 ** hp:vsV-shaped 0.11657 0.04130 2.822 0.008677 **

It is also possible to use

contrasts(my_data$vs) <- contr.sum(2)

to yield

> contrasts(my_data$vs)
         [,1]
V-shaped    1
straight   -1

Running the regression

reg_inter3 <- my_data |>
 lm(mpg ~ hp*vs, data = _)

then returns

> summary(reg_inter3)

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 31.74846 2.29080 13.859 4.64e-14 *** hp -0.09982 0.02065 -4.833 4.37e-05 *** vs1 -7.25209 2.29080 -3.166 0.00371 ** hp:vs1 0.05828 0.02065 2.822 0.00868 **


The questions

Question 1

It is correct, that the estimate of hp for both groups can be calculated by deleting the other data, isn't it? Like in reg_v and reg_st. Like mathematically correct.

Question 2

Suppose I only calculated the following regression:

reg_inter1 <- my_data |>
 lm(mpg ~ hp*vs, data = _)

returning

> summary(reg_inter1)

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 24.49637 2.73893 8.944 1.07e-09 *** hp -0.04153 0.01379 -3.011 0.00547 ** vsstraight 14.50418 4.58160 3.166 0.00371 ** hp:vsstraight -0.11657 0.04130 -2.822 0.00868 **

Now, one can easily calculate the estimate of hp for the other group (V-shaped) by

-0.04153 + (-0.11657) = -0.15810

How can I calculate the std. error, t value, p value, and possibly CI or other statistics for V-shaped based of this regression?

Question 3 (Mainly a reformulation of question 2 in my eyes)

Given these results

> summary(reg_inter3)

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 31.74846 2.29080 13.859 4.64e-14 *** hp -0.09982 0.02065 -4.833 4.37e-05 *** vs1 -7.25209 2.29080 -3.166 0.00371 ** hp:vs1 0.05828 0.02065 2.822 0.00868 **

I can state that hp is a significant ($p = 4.37e-05$) predictor with $\beta = -0.09982$ for all cars taken together.

I can also calculate

$$\beta_{V-shaped} = -0.09982 + 1 * 0.05828 = -0.04154$$

and

$$\beta_{straight} = -0.09982 + (-1) * 0.05828 = -0.1581$$

for the two subsamples each.

How do I check if these $\beta$ values are significant? Do I have do run the two additional regressions reg_v and reg_st presented at the very beginning, i.e., delete all cars that are built with the opposite engine?

user1
  • 101
  • Because it's unclear what "dummy manifestation" means and since all SEs, t-values, p-values, etc. are immediately accessible through the output of summary, I am reluctant even to guess what you are trying to ask. Could you clarify? – whuber Nov 08 '23 at 19:17
  • @whuber Thanks for your reply! I updated the question, does this clarify what I am meaning? – user1 Nov 08 '23 at 19:24
  • 2
    Thank you. It looks like this is an example of testing a "contrast." Take a look at our posts on this topic. For instance, this is a good answer: https://stats.stackexchange.com/a/463143/919. – whuber Nov 08 '23 at 19:26
  • @whuber I am afraid I don't get where you are pointing me to. There are 460 results and I struggle to judge the relevance of the several posts. Do you suggest to use contrast coding (V-shaped=-1, straight=1) instead of dummy coding (V-shaped=0, straight=1). Which would result in two interaction terms in the model? – user1 Nov 08 '23 at 19:50
  • @whuber I read the answer you suggested and some textbooks and an encyclopaedia on the topic of "contrast analysis". I am sure, that I did not get everything. But I do not see any answer to my problem. I added more information above. Note, that vs has exactly two values, V-shaped and straight, which is kind of an edge case and not explicitly handled in my sources. Nevertheless, I was always able to compute the different $\beta$ values but not the $p$ values. And after reading everything I am neither. I would really appreciate any help. – user1 Nov 09 '23 at 11:27

1 Answers1

1

Question 1.

Although you might get the same point estimates within each vs group by deleting the cases in the other vs group, you are not getting the same standard errors of the estimates. Examine, for example, the standard errors of the hp coefficient in the reg_st model and in the reg_inter2 model, for which the same hp coefficient is also reported for the vsstraight group. Those standard errors are important for attempts at inference. Also, note that this deletion approach isn't applicable to circumstances with multiple continuous predictors.

Question 2

You need to apply the formula for a weighted sum of correlated variables to get the standard errors for those estimates. That's because each of the coefficient estimates is potentially correlated with the other estimates. That calculation requires the coefficient variance-covariance matrix, which isn't usually shown in the printed report of a regression model. In R, the vcov() function applied to the model provides that matrix. It can be good to work through some simple examples to learn what's going on at first, but in general you are less likely to make mistakes if you use well vetted post-modeling tools like the R emmeans or car packages.

Question 3

For the overall significance of a predictor like hp that's involved in an interaction, you should use a "chunk" test including all coefficients involving hp. That's not what the hp coefficient in reg_inter3 represents. That hp coefficient is for a hypothetical intermediate value of vs that is between straight and V-shaped. That model has a statistically significant hp:vs1 interaction coefficient, which you will notice has the same p-value as the interaction coefficients in all of the other models.

You certainly do not need to run separate models for each value of a dichotomous predictor like vs. The emmeans and car packages, among others, provide tools for evaluating the predicted outcomes at any combinations of predictor values. With such tools, there's no need to move from the default treatment contrasts in R to the deviation contrasts that you got with contr.sum().

EdM
  • 92,183
  • 10
  • 92
  • 267