3

I want the weighted mean of my dv, velocity.

In this scenario, velocity is a derived/interpolated measure comprised of repeated measures of randomly sampled speeds in a given region. There will always be a fixed number of velocity measures derived from a smaller number of speed measurements (in the real data, not in these data).

In the sample data, there are up to 10 dv measurements for each unique combination of id, timepoint, and direction. The <=10 dv measures can be derived from at least 3 speed measures.

As such, dv's that were calculated where there was more speed sampling are more accurate. This method is standard and has been validated in my field.

Sample data:

dat <- structure(list(n_speed = c(7, 6, 5, 4, 7, 6, 4, 9), id = c("subj_1", 
"subj_1", "subj_1", "subj_1", "subj_2", "subj_2", "subj_2", "subj_2"
), timepoint = c("t1", "t1", "t2", "t2", "t1", "t1", "t2", "t2"
), direction = c("long", "lat", "long", "lat", "long", "lat", 
"long", "lat")), class = "data.frame", row.names = c(NA, -8L))

> head(dat) dv id intervention timepoint direction region 1 0.7708878 subj_1 ctrl t1 long healthy 2 0.9193373 subj_1 ctrl t1 long healthy 3 1.0000385 subj_1 ctrl t1 long healthy 4 0.6570246 subj_1 ctrl t1 long healthy 5 0.9345068 subj_1 ctrl t1 long healthy 6 0.9421999 subj_1 ctrl t1 long healthy

and

speed_measures <- structure(list(n_speed = c(7, 6, 5, 4, 7, 6, 4, 9), id = c("subj_1", 
"subj_1", "subj_1", "subj_1", "subj_2", "subj_2", "subj_2", "subj_2"
), timepoint = c("t1", "t1", "t2", "t2", "t1", "t1", "t2", "t2"
), direction = c("long", "lat", "long", "lat", "long", "lat", 
"long", "lat")), class = "data.frame", row.names = c(NA, -8L))

> speed_measures n_speed id timepoint direction 1 7 subj_1 t1 long 2 6 subj_1 t1 lat 3 5 subj_1 t2 long 4 4 subj_1 t2 lat 5 7 subj_2 t1 long 6 6 subj_2 t1 lat 7 4 subj_2 t2 long 8 9 subj_2 t2 lat

Here we can see that for subj_1 x t2 x lat, we derived all the dv's at the unique level of id, timepoint, and direction using 4 speed measures. Conversely, for subj_2 x t2 x lat, we derived dv using 9 speed measures. When we ultimately calculate the estimated marginal means of t2 x lat, subj_2 should have greater influence on the mean than subj_1

So we can join these table to see the number of speed measurements that went into deriving the dv measurements for each unique id x timepoint x direction combination. region and intervention are additional factors, but each speed measurement and derived velocity occurred at the unique id x timepoint x direction levels.

We will only take 70 of the 80 measures to simulate imbalance in the actual data.

dat_combined <- speed_measures |> left_join(dat_short) |> slice_sample(n = 70)

> head(dat_combined) n_speed id timepoint direction dv intervention region 1 7 subj_1 t1 long 0.7708878 ctrl healthy 2 7 subj_1 t1 long 0.9193373 ctrl healthy 3 7 subj_1 t1 long 1.0000385 ctrl healthy 4 7 subj_1 t1 long 0.6570246 ctrl healthy 5 7 subj_1 t1 long 0.9345068 ctrl healthy 6 7 subj_1 t1 long 0.9421999 ctrl healthy

The ultimate goal is to determine velocity changes at the 3 way interaction of timepoint, intervention, and region, controlling for/averaging over direction with emmeans.

I built the following mixed effects model using R lme4:

dv ~ intervention * timepoint * region + direction + (1|id)

This was my initial attempt to derive a semblance of weights:

dat_weighted <- dat_combined |> 
  group_by(id, timepoint, direction) |>
  mutate(length_dv = length(dv)) |>
  mutate(num_speed_per_velocity = n_speed/length_dv)

> head(dat_weighted)

A tibble: 6 × 9

Groups: id, timepoint, direction [5]

n_speed id timepoint direction dv intervention region length_dv num_speed_per_velocity <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <int> <dbl> 1 4 subj_2 t2 long 0.627 trt damaged 9 0.444 2 6 subj_1 t1 lat 0.508 ctrl damaged 9 0.667 3 6 subj_1 t1 lat 0.703 ctrl healthy 9 0.667 4 5 subj_1 t2 long 0.748 ctrl healthy 8 0.625 5 6 subj_2 t1 lat 0.326 trt damaged 9 0.667 6 9 subj_2 t2 lat 0.589 trt damaged 8 1.12

Looking forward to your input!

myfatson
  • 193

1 Answers1

2

The lmer function in the lme4 package allows you to specify an argument weights that gives weights used for fitting the model. From your description, it sounds like you want to give higher weight to observations which have a higher value for the n_speed variable, though you have not specified exactly how this variable should affect the weights for the observations, or how your DV measurement is derived from the different measurements taken.

Depending on your particular theory and approach, one reasonable way to set the weights in a regression analysis is to use inverse-variance weighting based on your view of the likely effect of the number of speed measurements on the error variance. This is complicated, since the effect of the number of measurements may be mediated through the relationship of the explanatory variables to the response. Nevertheless, some crude methods can be used to start you off.

Suppose that we are willing to treat the measurements as if they were independent measurements having errors with zero mean and fixed variance. Since you have not specified to the contrary, suppose furthermore that we treat your DV as an average of the different speed measurements. In this case the variance of the measurement error for the DV would be proportionate to $1/n_S$ where $n_S$ is the number of speed measurements in the analysis. Consequently, using inverse-variance weighting would mean that we would set the weights proportionate to $n_S$. To do this, you would fit your model with a command like this:

#Set model data and weights
DATA    <- dat_weighted
WEIGHTS <- DATA$n_speed

#Fit linear mixed model with weights on the data MODEL <- lmer(dv ~ factor(intervention)factor(timepoint)factor(region) + direction + (1|id), data = DATA, weights = WEIGHTS)

This type of command will fit your model in a way that weights each data point according to the specified weights (set proportionate to n_speed in this example). The resulting estimates for the relationship between the explanatory variables and the response variable will take account of that weighting. This would be a reasonable starting point if you want to use inverse-variance weighting under the assumption that your DV measure is formed as a mean from independent measurements. If these assumptions do not hold then you can make appropriate adjustments to form a more sophisticated weighting. Ultimately, your view of the appropriate weights will depend on how the DV measure is formed from the different speed measurements and how these measurements relate to one another.

I hope that this answer is enough to start you off on further research for your problem. I would recommend that you take some time to look at weighting methods in regression problems and the theory behind this. You should also take some time to look at relevant literature in your own field showing how weighting based on multiple measurements is usually done in this context.

Ben
  • 124,856
  • Thanks for the input Ben. Clarifications: 1) Yes, I do want more weight for observations which have a higher n_speed 2) How DV measurement is derived: We have speed measurements associated with a coordinate value in x,y,z space. Velocity estimation takes any three points on a surface and leverages rules of trigonometry so that they can be used in association with their speed measures to estimate the average velocity within the enclosed triangle. 3) You assumptions of error distribution and variance are fair. 4) My field has not incorporated weighted measures in this context - this is nascent – myfatson Sep 14 '22 at 14:23
  • Okay, based on that I think it would be worth looking at the effect of multiple measurements on the level of variation in error using the trig rules at issue. That would be the next step in the field to determine the appropriate weighting as a function of n_speed. For now the method in the answer is a reasonable placeholder. – Ben Sep 14 '22 at 21:20