Confusion about the use of the MLE & the posterior in parameter estimation for logistic regression

Question

In classification one usually computes $$ C = \operatorname*{argmax}_k p(C=k\mid X) $$ where $p(C=k\mid X)$ is the posterior distribution.

In a simple logistic regression setting with $C \in \{0, 1\}$ and $$ p(C=1\mid X)=\frac{\exp(\beta_0+\beta_1 x_i)}{1+\exp(\beta_0+\beta_1 x_i)} $$ and therefore $$ p(C=0\mid X)=\frac{1}{1+\exp(\beta_0+\beta_1 x_i)} $$ with $X=\{x_i\},\ i=1,\ldots,N$.

we estimate the parameters $\beta_0, \beta_1$ via the maximum likelihood estimation. To do so one has to compute the product of the likelihood function of all $N$ observations. So far, so normal. However, in all text books the authors plug in the posterior instead of the likelihood (e.g., Bishop, p. 206, Hastie, et al., p. 120): \begin{align} \ell(\beta) &= \log\left(\prod_{i=1}^N p(C_i=k\mid x_i, \beta)\right) \\[8pt] &= \log\left(\prod_{i=1}^N p(C_i=1\mid x_i, \beta)^{C_i}(1-p(C_i=1\mid x_i, \beta))^{1-C_i}\right) \end{align} And even though these probabilities are now conditioned on $\beta$ as well, they are still no likelihood to the posterior $p(C=k\mid X)$. So

How come we plug in the MLE just the posterior conditioned on the parameter $\beta$?
Why is $p(C=k\mid X)$ a posterior anyway? To me a posterior is a distribution of over a parameter given the observed data. But the class $C$ is to me not a parameter but a target just like the observations $y_i$ in a linear regression setting.

In the above mentioned textbooks (pattern recognition and machine learning by bishop and elements of statistical leaning by Haste at al.) they explicitly call $p(C=k|X)$ the posterior distribution on several occasions... — guest1, Nov 19 '18 at 20:55
It can be, but it's not in what you write. You write $p(C=k|X) = l(X, \beta)$. Where's the prior? Write $p(C|X) = l(X, \beta) p(C)$ and make life more interesting, or Bayesian at least. — AdamO, Nov 19 '18 at 21:45
Well in logistic Regression we learn the posterior directly since it is a discriminative algorithm. A prior would only be used in a generative algorithm like LDA, where we learn the likelihood and the prior. To me if $p(C=k|X) $ is the posterior, then the likelihood should look like $ p(X|C) $ or similarly. — guest1, Nov 19 '18 at 21:58
No. Logistic regression is maximum likelihood. A prior is used in any Bayesian analysis: no prior, not Bayesian. A likelihood is not a probability function. It has to be scaled by a prior. — AdamO, Nov 19 '18 at 22:00
I have never said it was a Bayesian Analysis. All I was saying is that the text books tell that this term is a posterior Probability but then use it in the Maximum likelihood estimation as a likelihood function which seems to be a contradiction. I am aware that LR uses MLE to fit the model, but the selection of class $C_i$ of observation $x_i$ is then again done by choosing the class with the Maximum posterior given $x_i$. — guest1, Nov 20 '18 at 06:18

bebi · Answer 1 · 2019-04-30T22:38:37.973

Bit late, but this too gave me headache... Maybe we could argue as follows (two class case). Start with the usual expression for the log-ML, being product of of joint distributions as function of distribution parameters.

\begin{align} \ell(\beta) &= \log\prod_{i=1}^N p(C=1, x_i\mid\beta)^{t_i} \cdot p(C=0, x_i\mid\beta)^{1-t_i} \\[8pt] &= \log\prod_{i=1}^N \left( p(C=1\mid x_i,\beta)~p(x_i)\right)^{t_i} \cdot \left(p(C=0\mid x_i,\beta)~p(x_i)\right)^{1-t_i} \\[8pt] &= \log\prod_{i=1}^N p(C=1\mid x_i,\beta)^{t_i} \cdot p(C=0\mid x_i,\beta)^{1-t_i}~~ p(x_i) \\[8pt] &= \sum_{i=1}^N t_i\cdot \log (p(C=1\mid x_i,\beta))+ (1-t_i)\cdot \log(p(C=0\mid x_i,\beta))+\log(p(x_i)) \\[8pt] \end{align}

Now, as you minimize the expression (where the sigmoid is used as ansatz for the posterior $p(C\mid x)$) with respect to $\beta$ the marginal distribution $p(x_i)$ is dropped and the same minimum will be found independent if you start with the posterior probabilities $p(C\mid x)$ or joint probabilities $p(C,x)$ in the expression for the likelihood. I think this is equivalent to the fact that bigger $p(C\mid x)$ implies bigger $p(C,x)$ for a class.

EDIT: I opened a new post for proposing a consistent problem description/suggested solution. MLE for logistic regression, formal derivation

Thanks a lot for your answer. I like the approach but I think there is an error from the first to the second line. I think $p(C=1, x_i|\beta) = p(C=1|x_i, \beta) p(x_i|\beta)$, i.e., I think the prior should be conditioned on beta for that equality to be true.
PS: I looked into the links provided in the topic that you opened, and to me they do not clairify this specific matter, what do you think? — guest1, May 02 '19 at 16:49
@guest1: Can you specify what is not clear yet. Hopefully sows residual misunderst. on my side too. I would summarize: We are not trying to model $p(x)$ at all, the $x_i$ are fixed. [Note, that the sampling of the $x_i$ apparently introduces bias on the logit MLE parameter: https://stats.stackexchange.com/questions/6067/does-an-unbalanced-sample-matter-when-doing-logistic-regression). So,$p(x)$ does not depend on the logit parameters, can proceed suggested. Intuition: We want to find the best logit fit to targets $t_i$ "above" given points $x_i$, hence the product of posteriors can be used. — bebi, May 08 '19 at 11:49

Confusion about the use of the MLE & the posterior in parameter estimation for logistic regression

1 Answers1

Linked