2

Suppose I have two dependent variables X and Y which depend upon an independent value t. I would like to model corr(X,Y) ~ t so that I can detect if the correlation between X and Y changes over time. Specifically, I want the correlation to be conditioned on t and linearly (or more generally smoothly) varying with t, but the measurements may all be at distinct t values so I can't just compute corr(X,Y) at each value of t.

Is there any literature on this? I imagine it's well-studied but I don't know the right keywords and searching for the obvious terms only gives me mountains of elementary material on regression and correlation.

I have some ideas for solving this of my own but I don't want to overlook existing literature. My simplest idea is to regress something like X ~ Y * t and look for interaction of Y and t. That's not quite what I'm looking for, though; the fit values aren't correlations and Y has measurement error, too. Another is to regress XY ~ t which has similar short-comings. I'd be more satisfied with something like a regression of (X,Y) = epsilon where epsilon is drawn iid from N(0, Sigma(t)) and Sigma(t) is a correlation matrix parameterized by t.

user32157
  • 403
  • I wonder how you can regress $X ~ Y * t$ if X and Y are taken at different times t. Are the X, Y data longitudinal? and do you have a model for the individual dependency of X and Y on t? – Ute Jul 11 '23 at 17:09
  • I was unclear. Each measurement is like (x,y) at some time t but each value of t may be distinct at different (x,y) points. It is longitudinal, though I'm guessing the theory is applicable in general contexts. I would also like to include dependence of mean values of X and Y on t but left that out for simplicity here as it seems like a second layer of complexity. – user32157 Jul 11 '23 at 17:36
  • Good that X and Y do not come at different times! In order to calculate the correlation, you could use a prediction for the means of X and Y as a function of t - therefore I was asking about the model. X and Y are continuous numeric variables, right? You idea XY ~ t seems not totally off at first glance. – Ute Jul 11 '23 at 18:16
  • 1
    There is literature. As I recall, the documentation for ECP provides useful background. This R package implements a changepoint detection method that, among other things, can find changes in the covariance of a vector process. It is based on an (effectively) online estimate of the multivariate distribution of $(X_t,Y_t).$ – whuber Jul 11 '23 at 18:44
  • X, Y, t are all continuous numeric, yes. – user32157 Jul 11 '23 at 19:26
  • 1
    Thanks for the pointer to ECP. Change-point analysis isn't exactly what I was hoping for as I expect a continuous change in these correlations and was hoping to be able to literally plot a corr(X,Y) as a function of time plot. Nonetheless, a good idea to get up to speed with that option incase it's the best I've got. – user32157 Jul 11 '23 at 19:29
  • 1
    Do you have good models for X ~ t and Y ~ t separately, just for the mean? And some model for how the variances of X and Y behave as a function of t (not necessarily homoscedasticity)? Variance, covariance, are less "easy" to estimate, so a strong model for Sigma(t) could help. How many data do you have? – Ute Jul 11 '23 at 20:59
  • 1
    Let's assume that I do have those marginal models down; I'd even be interested in theory that only applied when they are known to have constant mean zero and are homoskedastic. Part of the goal of this is to determine how many data points I need, but it's safe to say ~1000 are available. – user32157 Jul 11 '23 at 23:11
  • I've googled a bit and could not find anything that fits your problem - what is the scientific context? Maybe someone has tried to tackle the problem in a domain journal rather than in a statistical paper - wonder if there really isn't a solution out there. Otherwise it would rather be a paper than a short (or longer) forum post. Could be an interesting project, with the backing of trusted models... – Ute Jul 12 '23 at 11:18
  • 2
    Slightly newer related question: https://stats.stackexchange.com/q/621145/237561 – Ute Jul 12 '23 at 13:06
  • 1
    That is a very related question, good find. I'm doing this in bioinformatics, interested in gene co-expression. I was expecting to steal methodology from economics or general stats though. – user32157 Jul 12 '23 at 13:18
  • The only bioinformatics-specific prior art I know of is Liquid Assocation: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-014-0371-5 – user32157 Jul 12 '23 at 13:40
  • Although ECP might not directly solve your problem, its theoretical underpinning consists of a way to estimate a full conditional multivariate distribution of a vector response -- and that sounds exactly like what you need. – whuber Jul 12 '23 at 13:48
  • Could you elaborate in an answer? I'm not seeing what you're suggesting. It looks like ECP is built on a divergence metric to test for differences in multivariate distributions but doesn't estimate the distributions themselves. I also don't see anything in ECP about conditional multivariate distributions, except in the trivial case of conditioning on a discrete variable. – user32157 Jul 12 '23 at 15:13
  • Good idea to steal from economics :-) The suggestion of OP in the very related question (I almost suspected you to be the same person...) to bin the data is not too bad. They might want to use a kernel instead of bins. – Ute Jul 12 '23 at 17:18
  • The Liquid Association paper cites another paper by some of the authors (number 17). I used Google scholar on it to see who is citing reference [17], and found the recent paper (also with Ho as author) https://onlinelibrary.wiley.com/doi/full/10.1111/biom.13701 - this might be interesting. The abstract says something about "modelling covariate-dependent correlation structures". || Google scholar search: https://scholar.google.com/scholar?cites=7714915715696074284&as_sdt=2005&sciodt=0,5&hl=en – Ute Jul 12 '23 at 17:38
  • What is a "conditional multivariate distribution"?? Regardless, once you have an effective metric to compare distributions, you are well on your way to analyzing how the conditional response in a regression varies with the explanatory variables. It will be hard to suggest anything more specific than that until you disclose more specific information about your application. – whuber Jul 13 '23 at 14:53
  • "Conditional multivariate distribution" was taken verbatim from your comment, but I took it to mean the joint distribution of X and Y conditional on t, which is what I'm interested in estimating as function of t. I'm still not sure how to make use of a metric on distributions when I haven't yet been able to estimate the distribution. Which two distributions should I apply the metric to? – user32157 Jul 13 '23 at 19:43

0 Answers0