[EDIT: had fundamental misunderstanding, rephrasing question - thanks to @whuber for catching that]
I have some pretty simple regressions (linear & logistic) predicting a rate (continuous response var, linear regression) & direction (binary response var, logistic regression) from certain features of the distribution of some measured value. The distribution features are the predictor/explanatory variables.
The measured values are from a sample taken every day, but with different (and quite widely varying) sample sizes on different days, from dozens to thousands (total sample size of ~300k, average ~400/day). The theory is that a combination of features of the distribution each day (e.g. mean, s.d., skew, possibly quantiles) will relate to the relevant rate / direction on that day, so e.g.
rate ~ mean_x + skew_x.
Note that the rate / direction are not calculated from the same sample as the explanatory variables - they are observed separately on each day.
Given the range in sample sizes for each day, it seems there should be some way to weight the datapoints based on the sample size for the explanatory variables that day.
I thought weighted regression based on sample sizes (similar to this previous question and this post) was appropriate but it seems that's based on sample size of response variables, not explanatory variables, which is the reverse of what I have.
My revised question is thus: Is it appropriate to include some weighting of datapoints to reflect the differing sample sizes from which the explanatory variables for each datapoint are calculated? and if so, how?
- Intuitively, when drawing a sample from a population, the distribution of the sample approximates the distribution of the population. 2) The larger the sample, the smaller the sample variance, and the more closely the sample mean approximates the population mean. 3) Datapoints with larger sample sizes should thus carry more weight b/c they're more reliable. 4) But the relationship b/w sample size and how well other sample dist. moments approximate pop. dist. moments is more complicated?
– TY Lim May 08 '23 at 20:31Suppose your y1 is the mean of 1000 observations, your y2 is the mean of 600 observations, y3 is the mean of 400 observations. You would include it like this: lm(y~x,weights=c(1000,600,400))
– TY Lim May 08 '23 at 20:33rate ~ q05_x + median_x + q95_x)? Because the sample variance of the predictor variables would differ across dimensions of the same datapoint based on not just the sample size but the specific quantile in question. So what would be the appropriate way to weight the datapoints for the regression? – TY Lim May 08 '23 at 22:02