1

Very belated follow-up to a previous question:

I have some pretty simple linear models predicting a rate (continuous response var) from certain features of the distribution of some measured value. The distribution features are the predictor/explanatory variables. The theory is that a combination of features of the distribution each day (e.g. mean, s.d., skew, possibly quantiles) will relate to the relevant rate / direction on that day, so e.g. rate ~ mean_x + skew_x

The measured predictor values are from a sample taken every day, but with different (and quite widely varying) sample sizes on different days, from dozens to thousands (total sample size of ~300k, average ~400/day).

Conceptually, I understand that if there is some measurement error in both predictor and response variables, then some kind of EIV regression may be in order. But that's about the extent of my understanding. From what I've seen it looks like you normally just feed in the measurement error for each predictor (and possibly the response), one value per variable, into the regression.

But I don't have a known or constant measurement error in the predictor variables; all I have is the sample size for each day (which would relate to the standard error in the measured mean, skewness, etc.). Some datapoints are thus expected to have much smaller or much larger measurement errors. Similarly, I don't have info on measurement error in the response variable, only a sample size for each day as well.

Is it feasible to use the info I have on sample sizes to inform an EIV regression? And how? (I'm working in R.)

TY Lim
  • 171
  • From "some datapoints are thus expected to have much smaller or much larger measurement errors" I conclude you anticipate heteroskedasticity in the measurement error variance, right? – Durden Feb 28 '24 at 04:29
  • @Durden I'm not sure, actually. Because the difference in predictor errors doesn't necessarily vary systematically with the response. There will be some datapoints where the predictors have low sample size / high error where the response has a high value, and some where the response has a low value, and vice-versa. Not sure if that counts as true heteroskedasticity? – TY Lim Feb 28 '24 at 14:33
  • Without going into too much detail, a EIV model requires you to specify the variance of the distribution of the true covariate $x^\ast$. This variance could be identical for every measured $x$, or it could vary depending on $x$ itself, time of sampling (which I think is what you mentioned), or anything else. It makes the model more complex and will require extraneous information (unless you have repeated measurements), but it is doable (in this JAGS example it would mean varying taux with each i, similar to how the mean truex[i] already does). – Durden Feb 28 '24 at 15:49
  • The variance would vary depending on sample size for each time/data point (which isn't systematically varying over time, but is known). The measurements for each time/data point aren't repeated.

    Since the true covariate here is a population mean, and the measured quantity is a sample mean, would it be possible to use (inverse of) sample size as a proxy for variance?

    (Note that for theoretical reasons the underlying distribution of values from which we're observing the mean is expected to change for different data points)

    – TY Lim Feb 28 '24 at 17:53
  • (I'm wondering if it could be similar to a weighted regression where you weight observations of the response variable based on sample sizes, similar to this: https://stats.stackexchange.com/questions/504572/including-a-weighting-variable-in-a-linear-regression - and in fact could you weight both predictors and response vars by their respective sample sizes?) – TY Lim Feb 28 '24 at 17:58

0 Answers0