What is the resulting distribution if I merge two different distributions?

Question

The title is not the best but I really do not know how to describe the scenario in a better way.

The context

Consider taking measurements of two different quantities:

The time needed for a car to traverse the city of Koenigsberg
The time needed for a person to traverse the city of Koenigsberg

Now, this is how measurements are done for every car and person:

As they enter the town, I would start a timer.
The car and the person would choose different paths inside the city but will eventually get out.
That is when I would stop the timer and record the time.

I get to collect 2 sets of times:

Let us call $C$ the random variable capturing the times of cars.
Let us call $H$ the random variable capturing the times of people.

Both random variables would be distributed according to a certain distribution. As the collected values are plotted in a graph, showing the frequencies of ranges of times (bins), it will be possible to get a glimpse of the PDF of both $C$ and $H$: $f_C$ and $f_H$.

Question

Consider now this procedure:

I take all the measurements done for cars.
I take all the measurements done for people.
I merge those into one collection.
I plot the frequency histogram of that set.

By doing so I would get a third PDF capturing a third random variable which I will call $X$.

How does $X$ relate to $C$ and $H$?
How can I mathematically retrieve $f_X$ from $f_C$ and $f_H$? What is the relation connecting the 3 distributions from an analytical perspective?

Some further reflection

This looks as if $X$ is a combination of $C$ and $H$:

X = g(C, H)

But what is $g$?

If this was a scenario such as $X = C + H$, it would be simple as $X$ would be the sum of two random variables, and there is extensive literature on how to tackle that situation. But here $X$ is not the sum of $C$ and $H$, is something else.

Have a look at mixture distributions – Ggjj11 Jul 05 '23 at 19:52 — Ggjj11, Jul 05 '23 at 19:52

whuber · Accepted Answer · 2023-07-06T14:34:49.563

Let $\mathcal C$ be the event where the person is in a car and $\mathcal H$ be its complement (not in a car). This event is random and your narrative implicitly supposes its probability $p = \Pr(\mathcal C)$ is unvarying.

Let $x$ represent any number and contemplate the cumulative distribution function of $X,$ defined as

$$\begin{aligned} F_X(x) &= \Pr(X\le x) \\ &= \Pr(H \le x\mid \mathcal H)\Pr(\mathcal H) + \Pr(C \le x\mid \mathcal C)\Pr(\mathcal C) \\ &= F_H(x)(1-p) + F_C(x)p.\\ \end{aligned}$$

These equalities use only basic properties of probabilities.

This convex combination of CDFs is called a mixture distribution with weights $1-p$ and $p.$

When the variables $H$ and $C$ have densities (pdfs), the density of the mixture is the same linear combination of the densities: that's a direct application of the sum rule of differentiation. In your notation, $f_X(x) = (1-p)f_H(x) + pf_C(x).$

What is $g$?

You ask to express $X$ as $X = g(H,C).$ That appears to be a form of weighted coproduct of random variables. I will sketch the construction. It generalizes a construction that has been called a "coproduct" where $p$ is limited to $1/2.$

The formal definitions tell us the random variables are functions $H:(\Omega_H,\mathfrak F_H, \mathbb P_H)\to \mathbb R$ and $C:(\Omega_C,\mathfrak F_C, \mathbb P_C)\to \mathbb R,$ possibly on two distinct probability spaces. Given these, define

$$\Omega = \{(\eta, 0)\mid \eta\in\Omega_H\}\cup \{(0,\gamma)\mid \gamma\in\Omega_C\}$$

(the set coproduct of the sample space), push the sigma algebras forward into $\Omega$ via the canonical embeddings $\Omega_H\to \Omega$ and $\Omega_C\to\Omega$ and generate a sigma-algebra $\mathfrak F$ from them, and for a specified $0\lt p\lt 1$ define a probability measure on that sigma-algebra via

$$\mathbb P((\mathcal H\times \{0\}) = (1-p)\mathbb P_H(\mathcal H)$$

for all $\mathcal H \in\mathfrak F_H$ and

$$\mathbb P(\{0\}\times \mathcal C) = p\mathbb P_C(\mathcal C)$$

for all $\mathcal C \in\mathfrak F_C.$ The random variable $X$ can then be defined as

$$X((\eta, 0)) = H(\eta);\quad X((0,\gamma)) = C(\gamma)$$

for all $\eta\in\Omega_H$ and $\gamma\in\Omega_C.$ It is an elementary exercise in applying definitions to verify that this is well-defined and $X$ is indeed a random variable whose distribution is the intended mixture of the distributions of $H$ and $C.$

A convenient notation to abbreviate this entire categorical construction would be something like

$$X = H\coprod_{(1-p,\ p)} C.$$

Remarks

For a fuller account of the definitions, see https://stats.stackexchange.com/a/149860/919.

For related calculations, including code to compute CDFs and quantile functions of mixtures, visit https://stats.stackexchange.com/a/411671/919.

To learn how to draw random variates from a mixture (with general-purpose code) see https://stats.stackexchange.com/a/64058/919.

You can conceive of any (non-constant) distribution as a mixture. The analysis at https://stats.stackexchange.com/a/299765/919 gives an interesting example of running this operation in reverse by dissecting a given distribution into a mixture of two other distributions.

Thanks for the thorough answer. I gave it an initial read and I think I understand now. I will go through again and become more familiar with all the concepts before accepting this answer. For now you totally deserve an upvote <3 — Andry, Jul 05 '23 at 20:59
Great answer! I'm a bit unsettled by the word "coproduct". In this case it's a linear combination; I'm used to the word "product" being used exclusively for bilinear combinations, not for linear combinations. Do you know a justification for this choice of vocabulary? — Stef, Jul 06 '23 at 07:48
@Stef See this article on coproducts in category theory. You could think of there being two simple ways to combine sets, the product and co-product; category theorists like to use the prefix "co-" to contrast pairs of things. If you want to learn more, I suggest Eugenia Cheng's book as a start. — Simon Crase, Jul 06 '23 at 08:07
@Simon Thank you. More specifically, the "co" refers to a systematic reversal of the arrows in categorical definitions and constructions. — whuber, Jul 06 '23 at 13:13