2

I know that we are able to use a partial correlation when we want to correlate X and Y but Z affects both of them and that we may use a semi-partial correlation when we want to correlate X and Y and we know that Z only affects X.

  • In my case, I have two different "Zs"/third variables.The idea is the following: I want to correlate X1 and X2, but I know that Z1 is extremely correlated to X1 and that Z2 is extremely correlated to X2.

So, I'm not assuming that Z1 and Z2 are correlated to both X1 and X2 (as it would be via a partial correlation, right?) I'm not assuming that Z1 and Z2 only effects X1 or X2 (as it would be for a semi-partial)

  • To put it simpler:

    Z1 is correlated to X1
    Z2 is correlated to X2
    I want to see if X1 and X2 are correlated controlling for the effects of Z1 on X1 and of Z2 on X2

Obs: all variables are continuos

  • Edit1: reproducible data: HERE

Any ideas would be MUCH appreaciated! Thanks in advance!

  • Edit2:

I was thinking about this solution after watching this video and from the kind comment below, I'm really hoping that this may be a good path :

If I apply the same logic as the video's for the partial corr, I guess that maybe it would be all right?

res1 <- rstandard(lm(X1 ~ Z1))
res 2 <- rstandard(lm(X2 ~ Z2))

so,

res1 = > the portion of X1 that is not explained by Z1
res2 = > the portion of X2 that is not explained by Z2 

so,

cor(res1,res2) = > should be the correlation between X1 (accouting for Z1's effect on it) and X2 (accounting for Z2's effect on it), right? 
  • Question 1: I'm assuming this would be fine for a Pearson's correlation, but my data is non-parametric. Would it be okay to apply a Spearman's correlation cor(res1, res2) ?

  • Question 2 : why rstandard ? Should we always apply this, even if data is non-parametric? (I guess so, since even Spearman corr is going to use Pearson's equation, after ranking the data, right?) but, still, I didn't quite understand why standartizing is a "must"

  • Edit 3 : Using rstandard or not

So,

  • When I standartize the residuals, I get a value of Pearson's rho cor(res1, res2) (0.95) and slope lm(res1 ~ res2) (0.95), which differs from Pearson's rho pcor.test(X1, X2, Z) (0.96) (I know the difference in this case seems minimal, but I'd want to understand why is there a difference in the first place...)

  • when I don't standartize, the Pearson's rho(res1, res2) (0.96) is equal to Pearson's rho pcor.test(X1, X2, Z) (0.96) , but not to the slope lm(res1 ~ res2) (1.477e-04) anymore...why is that?

  • idea: take the residuals of x1 in a linear regression on z1, and similarly the residuals of x2 in a regression analysis against z2. find the correlation between the residuals of each variable. If this is not the answer you're looking for, please explain what you need in more detail. – KishKash Sep 24 '22 at 05:51
  • @KishKash hi, thank you, I was wondering about the same solution! I'm so happy that that seems to be a good path, but I have some questions. I thought about this after watching this video (https://www.youtube.com/watch?v=Xii1jVLnX60&lc=Ugyr1Vlg5pBXqulNTa94AaABAg), I've changed a few comments with the author. I'm gonna add an edit to the post, would you please check it out? – Larissa Cury Sep 24 '22 at 12:03
  • In regard to the "nonparametric" issue https://stats.stackexchange.com/q/40995/3277 – ttnphns Sep 24 '22 at 12:20
  • @ttnphns , hi, thank you for the link, so, would that be okay to check the res1, res2 normality and calculate the Spearman corr between them, then? – Larissa Cury Sep 24 '22 at 12:50
  • @LarissaCury If I understand the workings of rstandard correctly, then I believe it would be fairly safe to drop it. Since it's based on a leave-one-out tactic, for a sufficiently large number of points and extreme outliers notwithstanding, its impact on the result is negligible anyway. I'm not a statistician, though. – KishKash Sep 26 '22 at 11:57
  • @KishKash , hi, thank you, so the results are quite different when I use rstandard and when I don't use it, I'll right a follow-up edit giving the details in a moment! – Larissa Cury Sep 26 '22 at 12:09
  • re Edit 3, 2nd bullet: In principle, shifting and scaling should not have any impact on the correlation of two series of numbers; however, rstandard does not apply perfect shifting/scaling of the input series, due to it being implemented with a leave-one-out mechanism. So small differences of 0.01 are to be expected. When you apply a linear model of res1 vs res2, the slope is not expected to equal the pearson correlation coefficient; therefore the fact that you got a different number should not be surprising. – KishKash Sep 28 '22 at 08:58
  • @KishKash , thank you once more! But why is that the slope lm(res1 ~ res2) is equal to Pearson's (res1, res2) R when I use rstand and it differs when I don't use it? (I'm sorry if it's a silly question...) – Larissa Cury Sep 28 '22 at 15:40
  • Not terribly important for the theory underlying the question, but the example dataset has missing pairs of observations. The pattern of missingness is not aligned, so the residuals are misaligned also. You can see this with table(names(res1) == names(res2)). Also, I have trouble reproducing the high correlation estimates reported. – dipetkov Oct 18 '23 at 22:21
  • @dipetkov , hi! Thank you. I updated the data file, see if it works now. I've just ran correlations between X1/Z1 and X2/Z2 and they're .98 and .95 – Larissa Cury Oct 19 '23 at 10:09
  • Perhaps I am missing something but (x1, z1) and (x2, z2) are each strongly linearly correlated. After I regress the x's on the z's, the residuals are very noisy and the correlation between res1 and res2 is very low. Can you double check? I think the approach is reasonable but it's also confusing whether you are looking at the correlations between the original variables or between the residuals. (The question seems to be about the latter.) – dipetkov Oct 19 '23 at 12:13
  • So, I'm looking, ultimately, to the relantionship between X1 and X2. That's the most important thing. However, since I know that z1 and z2 have, respectively, a very strong relantionship with X1 and X2, I would like to be able to say that I'm indeed looking at the relantionship between X1 and X2 accounting for their relantions with z1 and z2, 'controling' for z1 and z2 respetively @dipetkov – Larissa Cury Oct 20 '23 at 14:52
  • z is the same assessment, but done in two languages (aka, 1 and 2) and we know that it has a very strong relantionship with the x scores, that's why we're trying to control for it. However, z1 - x1 and z2 - x2. That's why I cannot use just a partial or semi partial correlation, because the relantion is in respect to the language. Ultimately, we want to know if x1 and x2 increase together or not – Larissa Cury Oct 20 '23 at 14:56
  • I admit I don't understand what this question is saying. To be honest, you might get a better answer if you simplify. In particular the bit about whether to use residuals() or rstandard() is, I think, orthogonal to the main question. – dipetkov Oct 20 '23 at 17:02

1 Answers1

0

The problem

You say

I'm looking, ultimately [for] the relationship between $X_1$ and $X_2$

But since $Z_1$ and $Z_2$ are only related to one of both, you want to address this separately. Let's clarify first what concepts are we dealing with.

Definitions

Let us first recall the definition of correlation $\rho_{X_1,X_2}$:

$$ \rho_{X_1, X_2} = \frac{\mathbb{E}\bigg[(X_1 - \mathbb{E}[X_1])(X_2 - \mathbb{E}[X_2])\bigg]}{\sigma_{X_1}\sigma_{X_2}} $$

, where $\sigma_\cdot$ is the standard deviation of a specific variable.

Then, the semi-partial correlation you're interested in by removing the effects of $Z_1$ on $X_1$ and of $Z_2$ on $X_2$ is, that is $\rho_{X_1 | Z_1, X_2 | Z_2}$ is defined as:

$$ \rho_{X_1| Z_1, X_2 |Z_2} = \frac{\mathbb{E}\bigg[(\epsilon_{X_1 \sim Z_1})(\epsilon_{X_2 \sim Z_2})\bigg]}{\sigma_{\epsilon_{X_1 \sim Z_1}}\sigma_{\epsilon_{X_2 \sim Z_2}}} $$

, where $\epsilon_\cdot$ correspond to the errors from a model that regresses the variables in the subscript (e.g. the regression of $X_1$ by $Z_1$ if $\epsilon_{X_1 \sim Z_1}$). The differences from the mean disappear since errors are assumed to have mean zero: $\mathbb{E}[\epsilon_\cdot] = 0$

Lastly, let us recall that the slope of a simple linear regression of $Y$ on $X$ is:

$$ \beta_{Y \sim X} = \frac{\operatorname{Cov}[X, Y]}{\operatorname{Var}[X]} = \frac{\mathbb{E}\bigg[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y]\bigg]}{\sigma_X \sigma_X} $$

We can see that the slope coefficient for the regression of the errors of $X_1$ on the errors of $X_2$, that is $\beta_{\epsilon_{X_1 \sim Z_1} \sim \epsilon_{X_2 \sim Z_2}}$, will be equal to our semi-partial correlation of interest only if $\sigma_X\sigma_X = \sigma_{\epsilon_{X_1 \sim Z_1}}\sigma_{\epsilon_{X_2 \sim Z_2}}$. With this, we can now answer your questions.

Answers

Answer 1

You say that your "data is non-parametric", which, although often described like this by applied researchers, does not make sense. A model can be parametric or non-parametric, data by itself has no parameters. Even more so, there is also the related common misconception that data must be normally distributed, rather than the errors from a model. What you mean by "okay" depends on what is your goal, and in this case whether the association as measured by ranks (which is what Spearman's correlation does) matters to you.

Answer 2

You don't necessarily need to standardize the residuals, as we see from the correlation definition that a correlation is a standardized covariance. Thus, computing a correlation will divide by the standard deviations anyways.

Follow-up answers

You then ask for why you get different answers from pcor.test(X1, X2, Z) than from the previous methods$^1$. Since you do not share your code, I will assume that $Z$ is a vector of both $Z_1$ and $Z_2$. Then, unless $Z_1$ is uncorrelated with $X_2$ and $Z_2$ is uncorrelated with $X_1$ (in which case doing a partial correlation on $Z_1$ and $Z_2$ would have been simpler), it is natural to expect the result to be different: $\rho_{X_1 | Z_1, X_2 | Z_2} \neq \rho_{X_1 | \{Z_1, Z_2\}, X_2 | \{Z_1, Z_2 \}}$.

Then, you ask why the correlation result is different if you do not standardize. Note our last equation in the definition section, the slope coefficient is the covariance divided by the standard deviation of your predictor squared (i.e. the variance). If you want the slope coefficient to equal the correlation, you need it to be divided by the product of standard deviations of both variables. This is exactly what happens when you standardize, since then the standard deviation (and variances) of the variables become one: $\sigma_{\epsilon^{\text{std}}_{X_1 | Z_1}} = \sigma_{\epsilon^{\text{std}}_{X_2 | Z_2}} = \sigma_{\epsilon^{\text{std}}_{X_1 | Z_1}} \times \sigma_{\epsilon^{\text{std}}_{X_1 | Z_1}} = \sigma_{\epsilon^{\text{std}}_{X_2 | Z_2}} \times \sigma_{\epsilon^{\text{std}}_{X_2 | Z_2}} = \sigma_{\epsilon^{\text{std}}_{X_1 | Z_1}} \times \sigma_{\epsilon^{\text{std}}_{X_2 | Z_2}} = 1$.


$^1$ This minimal difference could also just be approximation error from how different functions compute the same quantity.

Kuku
  • 1,452