Deriving Squared Loss Function from Normality Assumption of Output and Likelihood of Parameter

Question

This question will seem very beginner in this forum, but I'm indeed a beginner. I am attempting to understand method of least square for regression. So, likelihood of parameter is defined as $$\mathcal L(\vec\theta)\stackrel{\text{def}}=\prod_{i=1}^mp_Y(\vec y^{(i)}|\vec x^{(i)};\vec\theta)$$

It seems like the squared cost function $\mathcal J_\text{sq}(\vec\theta)$ is derived from $\mathcal L(\vec\theta)$ in the following steps: $$\begin{align}\ln\mathcal L(\vec\theta)&=\ln\prod_{i=1}^mp_Y(\vec y^{(i)}|\vec x^{(i)};\vec\theta)\\&=\ln\prod_{i=1}^m\frac1{\sigma\sqrt{2\pi}}\exp(-\frac12(\frac{\vec y^{(i)}-\hat y_\theta^{(i)}}{\sigma})^2)\\&=\ln\frac1{\sigma\sqrt{2\pi}}-\frac1{2\sigma^2}\sum_{i=1}^m(\vec y^{(i)}-\hat y_\theta^{(i)})^2\end{align}$$$$\begin{align}&\because f(\cdot)=-\frac{\sigma^2}m((\cdot)-\ln\frac{1}{\sigma\sqrt{2\pi}})\text{ is decreasing. }\\&\therefore\text{As }\mathcal J_\text{sq}(\vec\theta)=\frac1{2m}\sum_{i=1}^m(\vec y^{(i)}-\hat y_\theta^{(i)})^2\text{ decreases, }\mathcal L(\vec\theta)\text{ increases. }\end{align}$$ I understand every line in the derivation except the second line. Line 2 implies that $$p_Y(\vec y^{(i)}|\vec x^{(i)};\vec\theta)=\frac1{\sigma\sqrt{2\pi}}\exp(-\frac12(\frac{\vec y^{(i)}-\hat y_\theta^{(i)}}{\sigma})^2)\tag{*}$$ because of normality assumption: $$\vec y^{(i)}|\vec x^{(i)};\vec\theta\sim\mathcal N(\hat y_\theta^{(i)},\sigma^2)$$ In (*), LHS is probability of event "$Y=\vec y^{(i)}$ given $X=\vec x^{(i)}$" parametrized by $\vec\theta$, while RHS is the probability density function (PDF) of $\vec y^{(i)}|\vec x^{(i)};\vec\theta$. LHS is a probability of continuous random variable $Y$ conditioned on $X$ parametrized by $\vec\theta$.

My confusion is: Generally, given continuous random variable Y, $$p[Y≤y]=\int_0^yf_Y(y)dy$$

Isn't $Y$ continuous? How do we even find the probability of event "$Y=\vec y^{(i)}$ given $X=\vec x^{(i)}$" parametrized by $\vec\theta$? Why isn't RHS of (*) an integral?

score 1 · Accepted Answer · answered Apr 03 '22 at 19:00

In (*), LHS is probability of event $``Y=\vec y^{(i)}$ given $X=\vec x^{(i)}"$

This is where you're wrong. Look up the definition of the Likelihood function, you will see that $p_Y(\vec y^{(i)}|\vec x^{(i)};\vec\theta)$ is in fact the conditional density of $Y$ at point $\vec y^{(i)}$, given $\vec x^{(i)}$ and $\theta$. The notations can be a little bit confusing so usually we define a probability space $(\Omega,\mathcal F,\mathbb P) $ to make everything clear.

When dealing with continuous random variables, you indeed always have $\mathbb P(Y=y)=0$, so when doing MLE of continuous random variables, we say that the "most likely" value is the one with the highest density instead of highest probability (since all the probabilities are zero). You can read some of the nice answers there to get a better intuition of the difference between the two notions.

Thanks a lot! I mixed up CDF and PDF in the question I'll fix later. And thanks a lot for clarifying likelihood, probability measure and probability density for me! I wonder if there is a way to express likelihood in terms of probability measure. — Jazon Leung, Apr 04 '22 at 05:22
In fact, there is. See here : https://stats.stackexchange.com/questions/444080/a-measure-theoretic-formulation-of-bayes-theorem — Stratos supports the strike, Apr 04 '22 at 06:06

Deriving Squared Loss Function from Normality Assumption of Output and Likelihood of Parameter

1 Answers1