0

[EDIT: had fundamental misunderstanding, rephrasing question - thanks to @whuber for catching that]

I have some pretty simple regressions (linear & logistic) predicting a rate (continuous response var, linear regression) & direction (binary response var, logistic regression) from certain features of the distribution of some measured value. The distribution features are the predictor/explanatory variables.

The measured values are from a sample taken every day, but with different (and quite widely varying) sample sizes on different days, from dozens to thousands (total sample size of ~300k, average ~400/day). The theory is that a combination of features of the distribution each day (e.g. mean, s.d., skew, possibly quantiles) will relate to the relevant rate / direction on that day, so e.g. rate ~ mean_x + skew_x.

Note that the rate / direction are not calculated from the same sample as the explanatory variables - they are observed separately on each day.

Given the range in sample sizes for each day, it seems there should be some way to weight the datapoints based on the sample size for the explanatory variables that day.

I thought weighted regression based on sample sizes (similar to this previous question and this post) was appropriate but it seems that's based on sample size of response variables, not explanatory variables, which is the reverse of what I have.

My revised question is thus: Is it appropriate to include some weighting of datapoints to reflect the differing sample sizes from which the explanatory variables for each datapoint are calculated? and if so, how?

TY Lim
  • 171
  • 1
    Could you explain how you are applying logistic regression, which models a 0/1 response, to predict "rate & direction"?? What kinds of values are these variables? BTW, the underlying principle is to weight a response variable by the reciprocal of its variance: it merely happens that the variance of a mean of iid values is inversely proportional to sample size. – whuber May 08 '23 at 16:42
  • Apologies for the ambiguity - they are separate regressions for predicting rate (continuous variable, linear regression) and direction (binary variable, logistic regression). – TY Lim May 08 '23 at 16:54
  • 1
    Must you use moment-based estimates of skewness (and other properties) or are you open to using other estimates, such as those based on order statistics? The latter might yield more stable and tractable solutions. The basic problem with higher moments is their inherent instability even with very large samples. – whuber May 08 '23 at 16:56
  • It could potentially done using order statistics rather than moments, yes. But I'm not sure how that would affect the question of weighting? – TY Lim May 08 '23 at 16:58
  • 1
    The variances of the order statistics (at fixed quantiles) enjoy the same asymptotic behavior with sample size as the mean, whereas that's not the case for raw moments. For the standardized moments the situation is complicated (because you're combining correlated statistics) and therefore might require substantial and difficult analysis to determine the proper weights, depending on what your estimator might be. – whuber May 08 '23 at 17:10
  • Thanks. To echo what you're saying as a check of my own understanding -
    1. Intuitively, when drawing a sample from a population, the distribution of the sample approximates the distribution of the population. 2) The larger the sample, the smaller the sample variance, and the more closely the sample mean approximates the population mean. 3) Datapoints with larger sample sizes should thus carry more weight b/c they're more reliable. 4) But the relationship b/w sample size and how well other sample dist. moments approximate pop. dist. moments is more complicated?
    – TY Lim May 08 '23 at 20:31
  • 1
    Yes, that's all basically correct. – whuber May 08 '23 at 20:32
  • Whereas with larger sample size / smaller sample variance, fixed-quantile order statistics for the sample do yield closer estimates to the population quantiles? So would the weighting (if using quantiles) be the same as for the mean - per the answer to the other question I linked:

    Suppose your y1 is the mean of 1000 observations, your y2 is the mean of 600 observations, y3 is the mean of 400 observations. You would include it like this: lm(y~x,weights=c(1000,600,400))

    – TY Lim May 08 '23 at 20:33
  • See https://stats.stackexchange.com/questions/45124 concerning the sampling variance of quantiles. It shows a dependence on three quantities: the underlying density at the quantile; the quantile itself; and the sample size, where the dependence is the same as for the sample mean. – whuber May 08 '23 at 20:38
  • Hmmmm. So if I'm understanding that post, the variance of sample quantiles will differ in part based on the quantile itself, right? In that case, wouldn't that pose a problem if I was using a model that included multiple quantiles (e.g. rate ~ q05_x + median_x + q95_x)? Because the sample variance of the predictor variables would differ across dimensions of the same datapoint based on not just the sample size but the specific quantile in question. So what would be the appropriate way to weight the datapoints for the regression? – TY Lim May 08 '23 at 22:02
  • You don't weight datapoints according to the variances of the explanatory variables: you weight them according to the response variable only. When the former variances are appreciable relative to the ranges of the explanatory variables in your data, then you are in trouble and you need an errors in variables model to cope with the bias that is introduced. – whuber May 08 '23 at 22:08
  • Oh dang you're right, I totally misunderstood / misread the linked question. Thanks for catching that. In that case though, my underlying question still stands (and is even more basic now) - is there some appropriate way to weight the different datapoints based on sample sizes for the explanatory vars? Have rephrased original post to reflect that. – TY Lim May 09 '23 at 14:06
  • 1
    My answer is embodied in my preceding comment: be very careful because you might be in an errors-in-variables situation. – whuber May 09 '23 at 17:00

0 Answers0