How does $y = ax + b + \epsilon$ induce a probability distribution on $R^2$?

Question

This is with regards to the example given in the wikipedia article on statistical models. In the example it is claimed that "Each possible value of $\theta = (b_0, b_1, σ^2)$ determines a distribution on S," but no further explanation is given. I understand that $(b_0, b_1, σ^2)$ gives us a height probability distribution for any given age, but I do not understand how this can be extended to a probability distribution on the space of all (age, height) pairs.

The most straightforward idea is to consider y-axis aligned bell curves tracking along the line $y = b_0x + b_1$, but this of course does not integrate to 1 and so does not constitute a probability distribution on S.

Edit for clarification: A probability distribution on $R^2$ is a positive function $p$, such that $\int_{R^2} p(x,y) dxdy = 1$. Given a linear model $y = b_0 x + b_1 + \epsilon$, where $\epsilon \sim N(0, \sigma^2)$, how does a fixed $(b_0, b_1, \sigma^2)$ induce such a function, as claimed in the linked article?

Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. — Community, Mar 26 '24 at 09:48
You are correct that with the limited information you supply, a distribution on $\mathbb R^2$ is not defined. However, you overlook the condition stated in the Wikipedia article: "with the ages of the children distributed uniformly." Although incomplete ("uniformly" on what set of ages, exactly?), it's clear: a distribution on $X$ and a conditional distribution for $Y$ given $X$ always determine a distribution on all $(X,Y)$ pairs. — whuber, Mar 26 '24 at 13:31

score 2 · Answer 1 · edited Mar 26 '24 at 09:23

2

You might want to adopt Bayesian notation to possibly make it more intuitive. Given a fixed set of parameters $\theta$ and your X coordinates, you identify a distribution.

$ Y|X,\theta\sim \mathcal{N}(b_0+b_1X,1\cdot\sigma^2) $

In this case it is normal because you make the assumption that your errors $\epsilon$ are normally distributed and adding a 'number' to a Gaussian random variable just shifts its mean.

edited Mar 26 '24 at 09:23

Sycorax

90,934

answered Mar 26 '24 at 09:18

xcesc

55
5

Maybe I was unclear, but this does not address my question. I've edited the question, hopefully it's more clear now what I'm asking. – John Doe Mar 26 '24 at 09:58
1

Assuming you have a marginal distribution of X you can simply use Bayes formula and what I've already written above to find that for a fixed $\theta$ $p(x,y)=p(y|x)p(x)$ which identifies your distribution. – xcesc Mar 26 '24 at 10:05

score 0 · Answer 2 · answered Mar 26 '24 at 16:56

The answers of whuber and xcesc are very clear. My mathematical skills are much less. But is your question not similar to the following simple example. 10% of all men do like shopping, 90% do not. For women it is the other way round. In total it is 50/50. For men and women separately the two probabilities sum to 1, for the total as well. The total probabilities sum to 1 not to 2 of course.

score 0 · Answer 3 · answered Mar 29 '24 at 13:16

To add on @xcesc's answer, the act of modeling $Y = f_\theta(X,\cdot)$ is essentially a way to approximate

$$P(X,Y) \approx P_\theta(Y\mid X)\cdot P(X)$$

This is not an equal sign because you are not sure your specified $P_\theta$ distribution can be equal to the true $P(Y\mid X)$ for some $\theta^\star$.

In all such attempts, you do not wish to model the marginal $P(X)$, you simply use $P(X) \approx \widehat{P}(X)$ the empirical distribution.

So in all such modelling tasks, we say that "$P_\theta$ induces a distribution on $S$ (here $\mathbb{R}^2$)" because we do not say "the obvious part", which is "we assume the input data comes from $P(X)$".

Then it is obviously true that $\theta_0 = (b_0, b_1)$ also implies a distribution on $\mathbb{R}^2$ through the same logic, however it is not a distribution that is consistent with the data because there are some data points $(x_i,y_i)$ for which $(x_i,y_i)\not\in\mathrm{Supp}\left(P_{\theta_0}(Y\mid X)P(X)\right)$.

The reason we wish to go beyond just using the joint empirical distribution $\widehat{P}(X,Y)$ (which is consistent with the data) is simply because in many settings, after getting finitely many $(x_i,y_i)$ pairs $1\leq i\leq N$, you often need to come up with a guess for $(x_{N+1}, y_{N+1})$ where $x_{N+1} \neq x_i$ for all $x_i$ in the dataset.

In this setting, just "registering" the frequency of the existing $(x,y)$ pairs into a table often yields lower predictive power than figuring a simplistic model of $Y\mid X=x$ that is defined for unknown $x$ values.

How does $y = ax + b + \epsilon$ induce a probability distribution on $R^2$?

3 Answers3