1

This is with regards to the example given in the wikipedia article on statistical models. In the example it is claimed that "Each possible value of $\theta = (b_0, b_1, σ^2)$ determines a distribution on S," but no further explanation is given. I understand that $(b_0, b_1, σ^2)$ gives us a height probability distribution for any given age, but I do not understand how this can be extended to a probability distribution on the space of all (age, height) pairs.

The most straightforward idea is to consider y-axis aligned bell curves tracking along the line $y = b_0x + b_1$, but this of course does not integrate to 1 and so does not constitute a probability distribution on S.

Edit for clarification: A probability distribution on $R^2$ is a positive function $p$, such that $\int_{R^2} p(x,y) dxdy = 1$. Given a linear model $y = b_0 x + b_1 + \epsilon$, where $\epsilon \sim N(0, \sigma^2)$, how does a fixed $(b_0, b_1, \sigma^2)$ induce such a function, as claimed in the linked article?

John Doe
  • 11
  • 2
  • Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. – Community Mar 26 '24 at 09:48
  • 3
    You are correct that with the limited information you supply, a distribution on $\mathbb R^2$ is not defined. However, you overlook the condition stated in the Wikipedia article: "with the ages of the children distributed uniformly." Although incomplete ("uniformly" on what set of ages, exactly?), it's clear: a distribution on $X$ and a conditional distribution for $Y$ given $X$ always determine a distribution on all $(X,Y)$ pairs. – whuber Mar 26 '24 at 13:31
  • 1
    Thanks whuber, that clears it up for me. – John Doe Mar 26 '24 at 19:06

3 Answers3

2

You might want to adopt Bayesian notation to possibly make it more intuitive. Given a fixed set of parameters $\theta$ and your X coordinates, you identify a distribution.

$ Y|X,\theta\sim \mathcal{N}(b_0+b_1X,1\cdot\sigma^2) $

In this case it is normal because you make the assumption that your errors $\epsilon$ are normally distributed and adding a 'number' to a Gaussian random variable just shifts its mean.

Sycorax
  • 90,934
xcesc
  • 55
  • 5
  • Maybe I was unclear, but this does not address my question. I've edited the question, hopefully it's more clear now what I'm asking. – John Doe Mar 26 '24 at 09:58
  • 1
    Assuming you have a marginal distribution of X you can simply use Bayes formula and what I've already written above to find that for a fixed $\theta$ $p(x,y)=p(y|x)p(x)$ which identifies your distribution. – xcesc Mar 26 '24 at 10:05
0

The answers of whuber and xcesc are very clear. My mathematical skills are much less. But is your question not similar to the following simple example. 10% of all men do like shopping, 90% do not. For women it is the other way round. In total it is 50/50. For men and women separately the two probabilities sum to 1, for the total as well. The total probabilities sum to 1 not to 2 of course.

BenP
  • 1,124
0

To add on @xcesc's answer, the act of modeling $Y = f_\theta(X,\cdot)$ is essentially a way to approximate

$$P(X,Y) \approx P_\theta(Y\mid X)\cdot P(X)$$

This is not an equal sign because you are not sure your specified $P_\theta$ distribution can be equal to the true $P(Y\mid X)$ for some $\theta^\star$.

In all such attempts, you do not wish to model the marginal $P(X)$, you simply use $P(X) \approx \widehat{P}(X)$ the empirical distribution.

So in all such modelling tasks, we say that "$P_\theta$ induces a distribution on $S$ (here $\mathbb{R}^2$)" because we do not say "the obvious part", which is "we assume the input data comes from $P(X)$".

Then it is obviously true that $\theta_0 = (b_0, b_1)$ also implies a distribution on $\mathbb{R}^2$ through the same logic, however it is not a distribution that is consistent with the data because there are some data points $(x_i,y_i)$ for which $(x_i,y_i)\not\in\mathrm{Supp}\left(P_{\theta_0}(Y\mid X)P(X)\right)$.

The reason we wish to go beyond just using the joint empirical distribution $\widehat{P}(X,Y)$ (which is consistent with the data) is simply because in many settings, after getting finitely many $(x_i,y_i)$ pairs $1\leq i\leq N$, you often need to come up with a guess for $(x_{N+1}, y_{N+1})$ where $x_{N+1} \neq x_i$ for all $x_i$ in the dataset.

In this setting, just "registering" the frequency of the existing $(x,y)$ pairs into a table often yields lower predictive power than figuring a simplistic model of $Y\mid X=x$ that is defined for unknown $x$ values.

NoVariation
  • 1,357