How do I calculate weights for weighted means?

Question

I want the weighted mean of my dv, velocity.

In this scenario, velocity is a derived/interpolated measure comprised of repeated measures of randomly sampled speeds in a given region. There will always be a fixed number of velocity measures derived from a smaller number of speed measurements (in the real data, not in these data).

In the sample data, there are up to 10 dv measurements for each unique combination of id, timepoint, and direction. The <=10 dv measures can be derived from at least 3 speed measures.

As such, dv's that were calculated where there was more speed sampling are more accurate. This method is standard and has been validated in my field.

Sample data:

dat <- structure(list(n_speed = c(7, 6, 5, 4, 7, 6, 4, 9), id = c("subj_1", 
"subj_1", "subj_1", "subj_1", "subj_2", "subj_2", "subj_2", "subj_2"
), timepoint = c("t1", "t1", "t2", "t2", "t1", "t1", "t2", "t2"
), direction = c("long", "lat", "long", "lat", "long", "lat", 
"long", "lat")), class = "data.frame", row.names = c(NA, -8L))
> head(dat)
          dv     id intervention timepoint direction  region
1  0.7708878 subj_1         ctrl        t1      long healthy
2  0.9193373 subj_1         ctrl        t1      long healthy
3  1.0000385 subj_1         ctrl        t1      long healthy
4  0.6570246 subj_1         ctrl        t1      long healthy
5  0.9345068 subj_1         ctrl        t1      long healthy
6  0.9421999 subj_1         ctrl        t1      long healthy

and

speed_measures <- structure(list(n_speed = c(7, 6, 5, 4, 7, 6, 4, 9), id = c("subj_1", 
"subj_1", "subj_1", "subj_1", "subj_2", "subj_2", "subj_2", "subj_2"
), timepoint = c("t1", "t1", "t2", "t2", "t1", "t1", "t2", "t2"
), direction = c("long", "lat", "long", "lat", "long", "lat", 
"long", "lat")), class = "data.frame", row.names = c(NA, -8L))
> speed_measures
  n_speed     id timepoint direction
1       7 subj_1        t1      long
2       6 subj_1        t1       lat
3       5 subj_1        t2      long
4       4 subj_1        t2       lat
5       7 subj_2        t1      long
6       6 subj_2        t1       lat
7       4 subj_2        t2      long
8       9 subj_2        t2       lat

Here we can see that for subj_1 x t2 x lat, we derived all the dv's at the unique level of id, timepoint, and direction using 4 speed measures. Conversely, for subj_2 x t2 x lat, we derived dv using 9 speed measures. When we ultimately calculate the estimated marginal means of t2 x lat, subj_2 should have greater influence on the mean than subj_1

So we can join these table to see the number of speed measurements that went into deriving the dv measurements for each unique id x timepoint x direction combination. region and intervention are additional factors, but each speed measurement and derived velocity occurred at the unique id x timepoint x direction levels.

We will only take 70 of the 80 measures to simulate imbalance in the actual data.

dat_combined <- speed_measures |> left_join(dat_short) |> slice_sample(n = 70)
> head(dat_combined)
  n_speed     id timepoint direction        dv intervention  region
1       7 subj_1        t1      long 0.7708878         ctrl healthy
2       7 subj_1        t1      long 0.9193373         ctrl healthy
3       7 subj_1        t1      long 1.0000385         ctrl healthy
4       7 subj_1        t1      long 0.6570246         ctrl healthy
5       7 subj_1        t1      long 0.9345068         ctrl healthy
6       7 subj_1        t1      long 0.9421999         ctrl healthy

The ultimate goal is to determine velocity changes at the 3 way interaction of timepoint, intervention, and region, controlling for/averaging over direction with emmeans.

I built the following mixed effects model using R lme4:

dv ~ intervention * timepoint * region + direction + (1|id)

This was my initial attempt to derive a semblance of weights:

dat_weighted <- dat_combined |> 
  group_by(id, timepoint, direction) |>
  mutate(length_dv = length(dv)) |>
  mutate(num_speed_per_velocity = n_speed/length_dv)
> head(dat_weighted)
A tibble: 6 × 9
Groups:   id, timepoint, direction [5]
n_speed id     timepoint direction    dv intervention region  length_dv num_speed_per_velocity
    <dbl> <chr>  <chr>     <chr>     <dbl> <chr>        <chr>       <int>                  <dbl>
1       4 subj_2 t2        long      0.627 trt          damaged         9                  0.444
2       6 subj_1 t1        lat       0.508 ctrl         damaged         9                  0.667
3       6 subj_1 t1        lat       0.703 ctrl         healthy         9                  0.667
4       5 subj_1 t2        long      0.748 ctrl         healthy         8                  0.625
5       6 subj_2 t1        lat       0.326 trt          damaged         9                  0.667
6       9 subj_2 t2        lat       0.589 trt          damaged         8                  1.12

Looking forward to your input!

The notion of a weighted average can have any weight function. You'll have to choose what weights are appropriate. — Galen, Sep 06 '22 at 03:02
@Galen I'm struggling to reason through how to define this weight function. I have updated my original post to hopefully clarify the question. — myfatson, Sep 06 '22 at 03:06
Maybe this could help: https://stats.stackexchange.com/questions/495811/are-there-better-approaches-than-the-weighted-mean/495845#495845 — kjetil b halvorsen, Sep 06 '22 at 03:10

score 2 · Answer 1 · answered Sep 14 '22 at 02:31

The lmer function in the lme4 package allows you to specify an argument weights that gives weights used for fitting the model. From your description, it sounds like you want to give higher weight to observations which have a higher value for the n_speed variable, though you have not specified exactly how this variable should affect the weights for the observations, or how your DV measurement is derived from the different measurements taken.

Depending on your particular theory and approach, one reasonable way to set the weights in a regression analysis is to use inverse-variance weighting based on your view of the likely effect of the number of speed measurements on the error variance. This is complicated, since the effect of the number of measurements may be mediated through the relationship of the explanatory variables to the response. Nevertheless, some crude methods can be used to start you off.

Suppose that we are willing to treat the measurements as if they were independent measurements having errors with zero mean and fixed variance. Since you have not specified to the contrary, suppose furthermore that we treat your DV as an average of the different speed measurements. In this case the variance of the measurement error for the DV would be proportionate to $1/n_S$ where $n_S$ is the number of speed measurements in the analysis. Consequently, using inverse-variance weighting would mean that we would set the weights proportionate to $n_S$. To do this, you would fit your model with a command like this:

#Set model data and weights
DATA    <- dat_weighted
WEIGHTS <- DATA$n_speed
#Fit linear mixed model with weights on the data
MODEL <- lmer(dv ~ factor(intervention)factor(timepoint)factor(region) + direction + (1|id),
              data = DATA, weights = WEIGHTS)

This type of command will fit your model in a way that weights each data point according to the specified weights (set proportionate to n_speed in this example). The resulting estimates for the relationship between the explanatory variables and the response variable will take account of that weighting. This would be a reasonable starting point if you want to use inverse-variance weighting under the assumption that your DV measure is formed as a mean from independent measurements. If these assumptions do not hold then you can make appropriate adjustments to form a more sophisticated weighting. Ultimately, your view of the appropriate weights will depend on how the DV measure is formed from the different speed measurements and how these measurements relate to one another.

I hope that this answer is enough to start you off on further research for your problem. I would recommend that you take some time to look at weighting methods in regression problems and the theory behind this. You should also take some time to look at relevant literature in your own field showing how weighting based on multiple measurements is usually done in this context.

Thanks for the input Ben. Clarifications: 1) Yes, I do want more weight for observations which have a higher n_speed 2) How DV measurement is derived: We have speed measurements associated with a coordinate value in x,y,z space. Velocity estimation takes any three points on a surface and leverages rules of trigonometry so that they can be used in association with their speed measures to estimate the average velocity within the enclosed triangle. 3) You assumptions of error distribution and variance are fair. 4) My field has not incorporated weighted measures in this context - this is nascent — myfatson, Sep 14 '22 at 14:23
Okay, based on that I think it would be worth looking at the effect of multiple measurements on the level of variation in error using the trig rules at issue. That would be the next step in the field to determine the appropriate weighting as a function of n_speed. For now the method in the answer is a reasonable placeholder. — Ben, Sep 14 '22 at 21:20

How do I calculate weights for weighted means?

A tibble: 6 × 9

Groups: id, timepoint, direction [5]

1 Answers1