0

In classification one usually computes $$ C = \operatorname*{argmax}_k p(C=k\mid X) $$ where $p(C=k\mid X)$ is the posterior distribution.

In a simple logistic regression setting with $C \in \{0, 1\}$ and $$ p(C=1\mid X)=\frac{\exp(\beta_0+\beta_1 x_i)}{1+\exp(\beta_0+\beta_1 x_i)} $$ and therefore $$ p(C=0\mid X)=\frac{1}{1+\exp(\beta_0+\beta_1 x_i)} $$ with $X=\{x_i\},\ i=1,\ldots,N$.

we estimate the parameters $\beta_0, \beta_1$ via the maximum likelihood estimation. To do so one has to compute the product of the likelihood function of all $N$ observations. So far, so normal. However, in all text books the authors plug in the posterior instead of the likelihood (e.g., Bishop, p. 206, Hastie, et al., p. 120): \begin{align} \ell(\beta) &= \log\left(\prod_{i=1}^N p(C_i=k\mid x_i, \beta)\right) \\[8pt] &= \log\left(\prod_{i=1}^N p(C_i=1\mid x_i, \beta)^{C_i}(1-p(C_i=1\mid x_i, \beta))^{1-C_i}\right) \end{align} And even though these probabilities are now conditioned on $\beta$ as well, they are still no likelihood to the posterior $p(C=k\mid X)$. So

  1. How come we plug in the MLE just the posterior conditioned on the parameter $\beta$?

  2. Why is $p(C=k\mid X)$ a posterior anyway? To me a posterior is a distribution of over a parameter given the observed data. But the class $C$ is to me not a parameter but a target just like the observations $y_i$ in a linear regression setting.

guest1
  • 727
  • In the above mentioned textbooks (pattern recognition and machine learning by bishop and elements of statistical leaning by Haste at al.) they explicitly call $p(C=k|X)$ the posterior distribution on several occasions... – guest1 Nov 19 '18 at 20:55
  • 3
    It can be, but it's not in what you write. You write $p(C=k|X) = l(X, \beta)$. Where's the prior? Write $p(C|X) = l(X, \beta) p(C)$ and make life more interesting, or Bayesian at least. – AdamO Nov 19 '18 at 21:45
  • Well in logistic Regression we learn the posterior directly since it is a discriminative algorithm. A prior would only be used in a generative algorithm like LDA, where we learn the likelihood and the prior. To me if $p(C=k|X) $ is the posterior, then the likelihood should look like $ p(X|C) $ or similarly. – guest1 Nov 19 '18 at 21:58
  • 4
    No. Logistic regression is maximum likelihood. A prior is used in any Bayesian analysis: no prior, not Bayesian. A likelihood is not a probability function. It has to be scaled by a prior. – AdamO Nov 19 '18 at 22:00
  • I have never said it was a Bayesian Analysis. All I was saying is that the text books tell that this term is a posterior Probability but then use it in the Maximum likelihood estimation as a likelihood function which seems to be a contradiction. I am aware that LR uses MLE to fit the model, but the selection of class $C_i$ of observation $x_i$ is then again done by choosing the class with the Maximum posterior given $x_i$. – guest1 Nov 20 '18 at 06:18