Which test to perform when checking for predicted vs actual sign of a relation between two variables?

Question

I have been contemplating which test or statistic method to implement, in order to check that the variables I use throughout my analysis, actually have the predicted relationsship with the dependent variable. Now, to preface this, I do not actually run any regressions at all.

I was thinking of either setting up a correlation table or perform a regression, to be able to see the signs? But I am not really sure what the intuitive difference between the two approaches are; i.e. if the correlation sign is +, then by all means - will the coefficient in a regression always be positive as well? Sorry if this is an entry level question, but it got me thinking what the conceptual difference is (besides the fact that you can use the regression coefficients to predict)..

Robert Long · Accepted Answer · 2019-04-13T18:34:30.257

I was thinking of either setting up a correlation table or perform a regression, to be able to see the signs? But I am not really sure what the intuitive difference between the two approaches are; i.e. if the correlation sign is +, then by all means - will the coefficient in a regression always be positive as well?

Yes. provided that the regression model is is a simple linear regression involving only the same two variables that you used to compute the correlation coefficient, and no others. This is because the regression coefficient is simply the (Pearson) correlation coefficient multiplied by the ratio of the standard deviations of the two variables. Since this ratio is always positive, it follows that the correlation coefficient and regression coefficient will have the same sign.

The difference between the correlation coefficient and the regression coefficient is that correlation measures the strength of a linear relationship, while regression quantifies it further by telling you by how much one variable will change when the other one is changed. Thus, the regression model computes a line of best fit, giving a slope and an intercept. Please refer to this question and it's answers for much more detail on the similarities and differences between correlation and regression:
What's the difference between correlation and simple linear regression?

If, however you fit a multivariable regression model, then things can change dramatically, including the sign of the regression coefficient, due to confounding. A simple example can demonstrate this:

First we create a dataset consisting of 3 variables:

> X <- c(1, 2, 3, 10, 11, 12)
> Y <- c(10.1, 9.2, 7.8, 14.9, 14.1, 12.9)
> C <- c(1, 1, 1, 2, 2, 2)

> (df <- cbind(Y,X,C))

        Y  X C
[1,] 10.1  1 1
[2,]  9.2  2 1
[3,]  7.8  3 1
[4,] 14.9 10 2
[5,] 14.1 11 2
[6,] 12.9 12 2

> cor(df)

      Y         X         C
Y 1.0000000 0.8661881 0.9410920
X 0.8661881 1.0000000 0.9839347
C 0.9410920 0.9839347 1.0000000

We see that all the correlations are positive.

First we regress Y on X:

> summary(lm(Y ~ X))
Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)   8.2733     1.1381   7.270   0.0019 **
X             0.4964     0.1432   3.467   0.0257 *

All good (apparently) - the regression coefficient is positive and equal to the correlation between X and Y multiplied by the ratio of their standard deviations:

> cor(X,Y) * sd(Y)/sd(X)
[1] 0.4964143

But now we introduce the 3rd variable, C:

> summary(lm(Y ~ X + C))

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -3.4250     0.6491  -5.276 0.013274 *  
X            -1.0750     0.0870 -12.356 0.001142 ** 
C            14.6083     0.7958  18.357 0.000353 ***

Whoops, the coefficient for X is now negative. Why does this happen ? This is an example of Simpson's paradox, where the inclusion of a 3rd variable changes the regression completely. This example is based on a long answer I wrote concerning Simpson's Paradox some time ago:
Can you please explain Simpson's paradox with equations, instead of contingency tables?

Hi Robert, thank you so much for the reply. Although I am a master's student in Finance and have done multiple statistics courses, I was not aware of the relation between correlation coefficients and simple regression coefficients! From what you posted, it seems that I should simply do a correlation table, to avoid any interference. — Philip, Apr 14 '19 at 07:29
@Philip No problem. I would, however urge caution. The point I was trying to make is that bivariate associations can often be confounded. Assumung that you can identity potential confounders, it is much better to fit a regression model with them included. — Robert Long, Apr 14 '19 at 07:36
@Philip please also refer to this Q/A https://stats.stackexchange.com/questions/402801/bivariate-analysis-as-a-basis-for-a-subsequent-analysis/402807 — Robert Long, Apr 14 '19 at 07:39

Which test to perform when checking for predicted vs actual sign of a relation between two variables?

1 Answers1