1

I have a dataset with two categorical predictors and a continuous response. Suppose the predictors have $a$ and $b$ levels, respectively, with group-specific parameters $\alpha_1, \dots, \alpha_a$ and $\beta_1, \dots, \beta_b$. I can implement a sum-to-zero constraint in a two-way ANOVA model such that

$$\sum_{i=1}^a \alpha_i = 0 \text{ and } \sum_{i=1}^b \beta_i = 0$$

but I want to impose a constraint that will cause the weighted average of the parameters to equal zero where the weights are numbers of observations. That is, if the first predictor has $m_1, \dots, m_a$ observations for its $a$ levels and the second predictor has $n_1, \dots, n_b$ observations for its $b$ levels, then I want

$$\sum_{i=1}^a m_i \alpha_i = 0 \text{ and } \sum_{i=1}^b n_i \beta_i = 0.$$

Is it possible to impose this constraint and minimize the sum of squared errors? If so, how would it be implemented?

EDIT: A commenter asked why I want this constraint. I have a dataset composed of observations from a game in which there are two distinct roles (an A role and a B role). For each event in the game, the role A player and the role B player play against each other and generate a result. The dataset has two predictors (roleA and roleB, which are categorical variables containing the names of the role A and role B players) and the response is points, a continuous variable. The code below generates an example dataset. Note that the players do not play an equal number of times.

N <- 10000
set.seed(1)

create random matchups of role A and role B players

roleA <- c(rep('A1', N * .4), rep('A2', N * .3), rep('A3', N * .15), rep('A4', N * .1), rep('A5', N * .05)) roleA <- as.factor(sample(roleA, N, replace = F)) roleB <- c(rep('B1', N * .4), rep('B2', N * .3), rep('B3', N * .2), rep('B4', N * .1)) roleB <- as.factor(sample(roleB, N, replace = F))

these vectors contain average points created per event by each player

weighted average is zero for both roles

A_means <- c(0.5, 0.1, 0, -1, -2.6) B_means <- c(0.6, 0.3, -0.5, -2.3)

points for each event is sum of points created by each player

get_points <- function(p1, p2){ r1 <- rnorm(1, mean = A_means[which(roleA_players == p1)], sd = 1) r2 <- rnorm(1, mean = B_means[which(roleB_players == p2)], sd = 1) r1 + r2 } points <- mapply(get_points, p1 = roleA, p2 = roleB)

head(data.frame(roleA, roleB, points)) roleA roleB points 1 A1 B1 -0.7608573 2 A1 B2 -1.4209561 3 A2 B3 -1.4254282 4 A4 B1 -0.2304648 5 A1 B1 2.1109341 6 A4 B1 0.8142395

I want to find the effect that each player has on points. For my dataset we can assume that the weighted average effects of the role A players and the role B players are both zero. However, because the players don't play an equal number of times, the parameters from a model with the conventional STZ constraint won't have a weighted average of zero. Thus, I am trying to find a way to implement the constraint described above.

Avraham
  • 3,737
  • 25
  • 43
  • Why do you want such a constraint? That maes for parameters with a data-dependent definition which are more difficult to interpret. Historically such constraints was used because it could simplify calculations, that is prehistoric now. – kjetil b halvorsen Sep 09 '17 at 19:04
  • The short answer is that the constraint makes sense for my dataset. I have added a longer explanation to my question along with an example dataset. – mr_gasbag Sep 10 '17 at 07:11
  • If I run your code then roleA_players is not defined – Sextus Empiricus Dec 27 '21 at 21:16

2 Answers2

2

You can do this with something similar like contrasts. You model it with variables that are linear sums of your original variables. The coefficients of the alternative model express the difference between your coefficients (and by expressing the model in terms of differences, you can ensure that your constraint is fulfilled).

Possibly there is a simple way to have some function in R do this. Below is an example that demonstrates it.

Let the model be

$$y = \alpha_1 x_1 + \alpha_2 x_2 + \alpha_3 x_3$$

and you have the constraint

$$m_1 \alpha_1 + m_2 \alpha_2 + m_3 \alpha_3 = 0$$ then you can use the alternative model

$$y = \beta_1 y_1 + \beta_2 y_2 $$

with

$$ y_1 = x_1 - \frac{m_1}{m_3} x_3 \quad \text{and} \quad y_2 = x_2 - \frac{m_2}{m_3} x_3 $$

and use the following transformation between coefficients

$$\begin{array}{} \alpha_1 &=& \beta_1 \\ \alpha_2 &=& \beta_2 \\ \alpha_3 &=& - \frac{m_1}{m_3} \beta_1 - \frac{m_2}{m_3} \beta_2 \\ \end{array}$$

If you fill these expressions into the alternative $y = \beta_1 y_1 + \beta_2 y_2 $ then you get the original $y = \alpha_1 x_1 + \alpha_2 x_2 + \alpha_3 x_3$

0

I believe you can just take $\alpha'_i=m_i\alpha_i$ and $\beta'_i=n_i\beta_i$ and estimate $\alpha'_i$ and $\beta'_i$ then just divide the estimates by $m_i$ and $n_i$ respectively, since you should know $m_i$ and $n_i$

tintinthong
  • 940
  • 2
  • 7
  • 18
  • Thank you, but I'm not following your answer. What does it mean to take $\alpha_i' = m_i \alpha_i$? Are you saying to estimate $\alpha_i$ using the usual STZ constraint and then multiply by $m_i$? Please clarify. – mr_gasbag Sep 04 '17 at 21:36
  • @mr_gasbag No that is not what I am saying. I am saying estimate your regression, for example, as $Y_i= \alpha'1 x{1i}+\alpha'2 x{2i} + \beta'1 x{3i}+ \beta'2 x{4i} $. And then get $\alpha_1= \frac{\alpha'_1} {m_1} $, $\alpha_2= \frac{\alpha'_2}{ m_2} $, $\beta_1= \frac{\beta'_1} {n_1} $, $\beta_2= \frac{\beta'_2} {n_2} $ – tintinthong Sep 05 '17 at 05:53
  • I had accepted this answer since it met the request that the weighted average of the parameters should be zero. However, the resulting parameters are nonsense and would produce large errors. After all, the same goal could be achieved by making all the parameters equal to zero. What I need to is for the weighted average to be zero and to minimize the sum of squared errors. I will clarify the question. – mr_gasbag Sep 09 '17 at 18:26