2

I like to infer the contribution of each rower in a crew boat from a number of races: 8 rowers are split repeatedly into two boats of 4 rowers each. The race over a distance leads to an estimate of the power the crew delivered. For example, the 4 rowers Amanda, Cam, Emily and Cait raced and delivered 711 Watt during the race. Likewise the 4 other rowers in the second race. My goal is to infer each rower's contribution, which is assumed to be constant over the races:

   Amanda Cam Emily Cait Paula Janeska Charli Diana    power
1       1   1     1    1     0       0      0     0 711.0960
2       0   0     0    0     1       1      1     1 667.5720
3       0   1     0    1     1       0      1     0 540.5055
4       1   0     1    0     0       1      0     1 783.7682
5       0   1     1    0     0       1      1     0 657.2489
6       1   0     0    1     1       0      0     1 667.5720
7       1   1     0    0     1       1      0     0 627.5287
8       0   0     1    1     0       0      1     1 590.6250
9       1   1     0    0     0       0      1     1 647.1376
10      0   0     1    1     1       1      0     0 599.5737
11      0   1     1    0     1       0      0     1 734.2822
12      1   0     0    1     0       1      1     0 608.7041

fit <- lm(power ~ 0 + Amanda + Cam + Emily + Cait

  • Paula + Janeska + Charli + Diana)

The basic idea is that the total power in each race is the sum of the individual power contributions and there is no other power source. Multi linear regression infers the coefficient for each rower towards the total and the intercept is zero because there is no other power source.

> summary(fit)

Call: lm(formula = power ~ 0 + Amanda + Cam + Emily + Cait + Paula + Janeska + Charli + Diana)

Residuals: 1 2 3 4 5 6 7 8 9 10 36.366 36.366 9.169 9.169 9.443 9.443 -43.891 -43.891 -29.612 -29.612 11 12 18.525 18.525

Coefficients: (1 not defined because of singularities) Estimate Std. Error t value Pr(>|t|)
Amanda 401.77 27.11 14.819 2.53e-05 *** Cam -43.29 30.47 -1.421 0.2146
Emily 409.47 27.11 15.103 2.31e-05 *** Cait -93.22 30.47 -3.059 0.0281 *
Paula 349.58 27.11 12.894 5.00e-05 *** Janeska -36.64 30.47 -1.202 0.2830
Charli 318.27 27.11 11.739 7.89e-05 *** Diana NA NA NA NA


Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 43.09 on 5 degrees of freedom Multiple R-squared: 0.9982, Adjusted R-squared: 0.9957 F-statistic: 396.7 on 7 and 5 DF, p-value: 1.483e-06

It almost works but here is where I could use help:

  • There are not enough races to solve this exactly. This could be understood as a linear algebra problem and we can't solve for 8 parameters from 12 equations here. Because races are tiring, we can't simply add races.

  • Is it possible to better describe what we learn about each rower as races progress?

  • Rather than asking for absolute power contributions, can we infer relative contributions? For example, rowers who participate in high-powered races are likely to be a contributor. How can this be better captured?

  • This is a new take on a similar question I had asked before.I am still looking for the right statistical framework.

R code:

Amanda  = c(1L, 0L, 0L, 1L, 0L, 1L, 1L, 0L, 1L, 0L, 0L, 1L) 
Cam     = c(1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L) 
Emily   = c(1L, 0L, 0L, 1L, 1L, 0L, 0L, 1L, 0L, 1L, 1L, 0L) 
Cait    = c(1L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L) 
Paula   = c(0L, 1L, 1L, 0L, 0L, 1L, 1L, 0L, 0L, 1L, 1L, 0L) 
Janeska = c(0L, 1L, 0L, 1L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 1L) 
Charli  = c(0L, 1L, 1L, 0L, 1L, 0L, 0L, 1L, 1L, 0L, 0L, 1L) 
Diana   = c(0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 1L, 0L, 1L, 0L)

power = c(711.096048081832, 667.572021484375, 540.50554255546, 783.768218256806, 657.248927084595, 667.572021484375, 627.528708276313, 590.625, 647.137583778637, 599.573712607379, 734.282165754758, 608.704121100815)

df <- data.frame(Amanda,Cam,Emily,Cait,Paula,Janeska,Charli,Diana,power)

fit <- lm(power ~ 0 + Amanda + Cam + Emily + Cait

  • Paula + Janeska + Charli + Diana)

Additional Constraints

This section was added late and covers some additional constraints that could be modelled.

  • Each rower has only one oar and thus a crew consists of two rowers rowing on so-called bow side and two on so-called stroke side. Rowers don't switch sides - a rower always rows on the same side.

  • Stroke side: Emily, Amanda, Paula, Charli; bow side: Diana Janeska, Cam, Cait

  • Pairs of races happen in short succession or side by side: 1/2, 3/4 and so on. This implies all rowers are split between the two boats racing. If we assume that a rower's power output is constant, that would imply that the total power of such a pair is constant as well. As can be seen, this is only approximately the case and is not modelled. A typical reason is that rowers get tired and can't emit as much power in their 3rd race as in the 1st.

  • Because rowers are rowing with one oar, the power difference in a crew between stroke side and bow side can't be too large as the boat would not go straight otherwise. This is currently not modelled.

Traditional Method

The traditional method of ranking rowers is based on the time races take: each rower accumulates the time they spent racing and are later ranked based on the accumulated time. This is equivalent to summing up for each rower the crew power of the boats they raced in and then ranking based on power. My goal is to improve on this as this method has no insight into the uncertainties.

  • 2
    It sounds like what you really needed was a better experimental design: the messages are telling you it's not mathematically possible to estimate coefficients from these 12 sets of teams, no matter what the output might have been. (The sum of the even-numbered columns equals the sum of the odd-numbered ones.) That's why the estimates are so whacky. BTW, it's usually a good idea to include an intercept until you have obtained results indicating it is inconsequential. At the very least, that would estimate any systematic bias in the power measurement system. – whuber Sep 08 '22 at 21:48
  • 2
    @whuber - that odd-even feature may be a consequence of how rowers actually row: the odd numbers may be rowers who row with their oars to starboard and the even number who row with their oars to the port side, with individual rowers having a personal preference - if you did not have balance then the boat would go round in circles. You still should be able to estimate the differences among each group. You can even estimate the best four consistent with this balance (I think Amanda, Emily, Janeska, Diana with an estimated power of $774.6$ compared to their actual $783.8$ in race $4$) – Henry Sep 08 '22 at 23:29
  • Indeed, this is from a class of boats where each rower only rows on one side. I believe what is missing is encoding that the two sides of the boat contribute about equal power as it otherwise would not go straight. Regarding the intercept: if It is not fixed to zero the best fit is a model with a high constant power and rowers contribute only small variations of additional power. – Christian Lindig Sep 09 '22 at 06:35
  • 1
    @Christian: right--and those "small variations" are the information being sought. – whuber Sep 09 '22 at 13:18
  • 1
    @Henry It's still possible, with a proper design, to estimate the relative power of all rowers. Imagine, for instance, that you kept one group of four starboard members while swapping out the port members. That would estimate the relative contributions of all port rowers. Do the same for the starboard rowers to estimate their relative contributions. – whuber Sep 09 '22 at 13:21
  • @whuber I am not free to re-design the race - this is existing data. But an optimal design is an interesting question. We can see that all rowers are involved in races 1/2, 3/4, 5/6 and so on. This is a consequence of that they are racing against each other. Assumptions would suggest that the total power of these pairs are constant - but in fact are not. I don't know yet if I should model this or how. – Christian Lindig Sep 09 '22 at 13:31
  • 1
    You were clear that you could not redesign the race. My point is that there are limitations inherent in your data that are impossible to overcome. You have to live with them. They are tantamount to having one unidentifiable parameter, which you will need to stipulate if you wish to obtain any meaningful estimates. – whuber Sep 09 '22 at 13:34
  • @whuber: In effect the calculations already performed may do this: the coefficients for Cam, Cait and Janeska seem to be relative to Diane's power contribution, while if you subtracted say $409.47$ from the other four, you could have their power contributions relative to Emily – Henry Sep 09 '22 at 15:46
  • 1
    @ChristianLindig: You are giving a lot of extra information in comments. Please add that as edits to the post, so it is complete. For the next time, you really need to design the experiment first, using principles for optimal experiment design. Start by looking inti fractional factorial experiments, and search this site for D-optimal design! And maybe you should include in the model info on left/right oars! – kjetil b halvorsen Sep 09 '22 at 19:06
  • I added more constraints that could be modelled to the original question. – Christian Lindig Sep 09 '22 at 19:44
  • @Henry As indeed they must. By dropping one of the variables, lm has arbitrarily selected a method to identify the rest. – whuber Sep 09 '22 at 21:26
  • Maybe I’m phrasing this wrong. After these races, what have we learned in terms of performance? Is there a better way to approach this except for a different design or more races? – Christian Lindig Sep 10 '22 at 08:34
  • 1
    If you make a mild implicit assumption, you might obtain useful estimates: you have repeatedly asserted the total power output of the two sides must be the same. Although this comes about because one side must be holding back, if only slightly, it does lead to an estimable model. Replace each observation by two observations: one for each side, to which half of the power is attributed. The model must accommodate non-independent responses, because each of these two halves is perfectly correlated. – whuber Sep 10 '22 at 15:14

1 Answers1

0

Based on the idea discussed in the comments of the question, here is a possible solution that takes an additional constraint into consideration: the total power of a crew is assumed to be split evenly between rowers of both sides. This is modelled by doubling the number of races with each race only having the rowers of one side into them, producing half the crew power.

   Amanda Cam Emily Cait Paula Janeska Charli Diana time    power crew
1       1   0     1    0     0       0      0     0  188 355.5480    1
2       0   0     0    0     1       0      1     0  192 333.7860    2
3       0   0     0    0     1       0      1     0  206 270.2528    3
4       1   0     1    0     0       0      0     0  182 391.8841    4
5       0   0     1    0     0       0      1     0  193 328.6245    5
6       1   0     0    0     1       0      0     0  192 333.7860    6
7       1   0     0    0     1       0      0     0  196 313.7644    7
8       0   0     1    0     0       0      1     0  200 295.3125    8
9       1   0     0    0     0       0      1     0  194 323.5688    9
10      0   0     1    0     1       0      0     0  199 299.7869   10
11      0   0     1    0     1       0      0     0  186 367.1411   11
12      1   0     0    0     0       0      1     0  198 304.3521   12
13      0   1     0    1     0       0      0     0  188 355.5480    1
14      0   0     0    0     0       1      0     1  192 333.7860    2
15      0   1     0    1     0       0      0     0  206 270.2528    3
16      0   0     0    0     0       1      0     1  182 391.8841    4
17      0   1     0    0     0       1      0     0  193 328.6245    5
18      0   0     0    1     0       0      0     1  192 333.7860    6
19      0   1     0    0     0       1      0     0  196 313.7644    7
20      0   0     0    1     0       0      0     1  200 295.3125    8
21      0   1     0    0     0       0      0     1  194 323.5688    9
22      0   0     0    1     0       1      0     0  199 299.7869   10
23      0   1     0    0     0       0      0     1  186 367.1411   11
24      0   0     0    1     0       1      0     0  198 304.3521   12

This leads to an estimate for the power of each rower:

   Diana    Emily   Amanda  Janeska      Cam    Paula     Cait   Charli 
184.8857 183.0903 179.2419 166.5655 163.2410 153.1454 138.2756 137.4902 

Below is the call the output of the linear model.

Call:
lm(formula = power ~ 0 + Amanda + Cam + Emily + Cait + Paula + 
    Janeska + Charli + Diana)

Residuals: Min 1Q Median 3Q Max -36.449 -19.063 -3.118 12.722 54.031

Coefficients: Estimate Std. Error t value Pr(>|t|)
Amanda 179.24 13.84 12.950 6.77e-10 *** Cam 163.24 13.84 11.794 2.64e-09 *** Emily 183.09 13.84 13.228 4.96e-10 *** Cait 138.28 13.84 9.990 2.79e-08 *** Paula 153.15 13.84 11.064 6.61e-09 *** Janeska 166.57 13.84 12.034 1.98e-09 *** Charli 137.49 13.84 9.933 3.01e-08 *** Diana 184.89 13.84 13.357 4.29e-10 ***


Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 30.32 on 16 degrees of freedom Multiple R-squared: 0.9943, Adjusted R-squared: 0.9915 F-statistic: 349.1 on 8 and 16 DF, p-value: < 2.2e-16

R structure for experiments:

structure(list(Amanda = c(1L, 0L, 0L, 1L, 0L, 1L, 1L, 0L, 1L, 
0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), 
    Cam = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 
    0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L), Emily = c(1L, 
    0L, 0L, 1L, 1L, 0L, 0L, 1L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), Cait = c(0L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 
    1L, 0L, 1L, 0L, 1L), Paula = c(0L, 1L, 1L, 0L, 0L, 1L, 1L, 
    0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
    0L, 0L), Janeska = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 1L, 0L, 1L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 1L
    ), Charli = c(0L, 1L, 1L, 0L, 1L, 0L, 0L, 1L, 1L, 0L, 0L, 
    1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), Diana = c(0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 
    0L, 1L, 0L, 1L, 1L, 0L, 1L, 0L), time = c(188, 192, 206, 
    182, 193, 192, 196, 200, 194, 199, 186, 198, 188, 192, 206, 
    182, 193, 192, 196, 200, 194, 199, 186, 198), power = c(355.548024040916, 
    333.786010742188, 270.25277127773, 391.884109128403, 328.624463542298, 
    333.786010742188, 313.764354138157, 295.3125, 323.568791889319, 
    299.78685630369, 367.141082877379, 304.352060550408, 355.548024040916, 
    333.786010742188, 270.25277127773, 391.884109128403, 328.624463542298, 
    333.786010742188, 313.764354138157, 295.3125, 323.568791889319, 
    299.78685630369, 367.141082877379, 304.352060550408), crew = c(1L, 
    2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 
    4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L)), class = "data.frame", row.names = c(NA, 
-24L))