How to separate the summation of sampling from two normal distribution?

Question

I am trying to analyze the historical result of golf. Golf is a game where the score you get depends not only on your skill but also on environmental factor such as weather and how easy/hard the course itself. I am trying to isolate the skill factor from the environmental factor. I have a question with a hypothetical game that emulates my issue.

Imagine that there are 100 players. Each player has a skill signified by its μ and σ. Each player's μ and σ are different and independent of each other. For the next 100 days, they play a game. This game also has its own μ and σ. None of these values are known to us.

Every day the game is played, we draw the number X from N(the game's μ, the game's σ). This simulates the environmental factor. Then, each player draws their own number Y from N(the player's own μ, the player's own σ). This simulates the skill factor. For that day, the player's score is the game's X on the day + the player's Y on that day. We know the player's score, but not either the game's X or the player's Y.

We can assume that each player's μ and σ and the game's μ and σ is constant. My question is it possible to work out each day's X (and thus the Y as well)? I do not necessarily need each day's absolute X, but even relative X (ie. Day 50's X - Day 51's X = 1) will be equally valuable.

The sum of two independent normals is normal: https://stats.stackexchange.com/questions/17800/what-is-the-distribution-of-the-sum-of-independent-normal-variables — kjetil b halvorsen, Mar 26 '22 at 12:49

score 1 · Answer 1 · answered Mar 26 '22 at 14:20

I consider a general case with $N$ players, $T$ courses and $R$ repetitions of the game. We consider two random variables:

$i$-th players' competence drawn from $p_i \sim N(\mu_{g_i},\sigma_{g_i}^2)$
$t$-th course difficulty (environmental factor) $g_t \sim N(\mu_{p_i},\sigma_{p_i}^2)$

As you describe it, the score $S_{it}$ is a sum of players' competence and the environmental factor: $$ S_{it} = p_i + g_t $$ so that the score is also normally distributed (sum of two Gaussians result in a Gaussian) $S_{it} \sim N(\mu_{it}, \sigma_{it}^2)$ with $\mu_{it} = \mu_{p_i}+\mu_{g_t}$ and $\sigma_{it}^2 = \sigma_{p_i}^2 + \sigma_{g_t}^2$.

Estimating the means/variances themselves is not possible. However, to establish who wins it is enough to look at relative scores! Therefore, we define two relative variables from the data $S_{it}$:

The relative performance of players for game $t$: $$ P_{ij}^{(t)} = S_{it} - S_{jt} $$
and relative performance of player $i$ between games $t$ and $s$: $$ G_{ts}^{(i)} = S_{it} - S_{is} $$

These variables are again Gaussian: $$ P_{ij}^{(t)} \sim N(\Delta \mu_{ij}^{P}, \Delta \sigma_{ij}^{P,(t)} )\\ G_{ts}^{(i)} \sim N(\Delta \mu_{ts}^{G}, \Delta \sigma_{ts}^{G,(i)} ) $$ where now the means are subtracted $\Delta \mu_{ij}^{P} = \mu_{p_i} - \mu_{p_j}$ and $\Delta \sigma_{ij}^{P,(t)} = \sigma_{it}^2 + \sigma_{jt}^2$, $\Delta \mu_{ts}^{G} = \mu_{g_t} - \mu_{g_s}$ and $\Delta \sigma_{ts}^{G,(i)} = \sigma_{it}^2 + \sigma_{is}^2$.

The crucial part is that these variables are helpful in finding two types of relevant probabilities:

the player $i$ will get a better score than player $j$ in a game $t$: $$ Prob(P_{ij}^{(t)}>0) = \frac{1}{2} \left ( 1 - \text{erf}\left ( - \frac{\Delta\mu_{ij}^{P}}{\sqrt{2} \Delta \sigma_{ij}^{P,(t)}} \right ) \right ) $$
and the player $i$ in the game $t$ will get a score higher than the game $s$: $$ Prob(G_{ts}^{(i)}>0) = \frac{1}{2} \left ( 1 - \text{erf}\left ( - \frac{\Delta\mu_{ts}^{G}}{\sqrt{2} \Delta \sigma_{ts}^{G,(i)}} \right ) \right ) $$ where both were evaluated readily via Gaussian CDF.

Now, since the variables are independent, one can answer many questions like the probability that player $i$ will beat the remaining competitors in the $t$-th game: $$ P_{best}(i,t) = \prod_{j (\neq i)} Prob(P_{ij}^{(t)}>0) $$

Finally, to make use of these formulas, we must estimate the mean $\Delta\mu_{ij}^{P}$ and the variance $\Delta \sigma_{ij}^{P,(t)}$ (the same follows for the means related to $G$). These we readily estimate by using a standard statistics argument for $R$ repetitions of the game-player pair: $$ \left < P_{ij}^{(t)} \right >_{repetitions} = \Delta\mu_{ij}^{P} $$ so that the estimator reads $$ \hat{\Delta}\mu_{ij}^{P} = \frac{1}{R} \sum_{r=1}^R \left ( S_{it,r} - S_{jt,r} \right ) $$ where we denote $S_{it,r}$ as the $r$-th game on the course $t$ played by $i$-th golfer. Similarly, the variance: $$ \left < \left ( P_{ij}^{(t)} - \left < P_{ij}^{(t)} \right >_{reps}^2 \right )^2 \right > = \left < P_{ij}^{(t)})^2 \right > - \left< \right >^2 = \Delta \sigma_{ij}^{P,(t)} $$ so that $$ \hat{\Delta} \sigma_{ij}^{P,(t)} = \frac{1}{R} \sum_{r=1}^R \left ( S_{it,r} - S_{jt,r} \right )^2 - (\hat{\Delta}\mu_{ij}^{P})^2 $$

How to separate the summation of sampling from two normal distribution?

1 Answers1