0

Let dataset $D = \{(x_1,y_1),...,(x_n,y_n)\}$ where $x\in\mathbb{R}^d$ and $y_i\in\{0,1\}$

There are 2 mean vectors $\mu_0,\mu_1\in\mathbb{R}^d$ that represent the means of each feature split by label. Meaning that $\mu_0$ represents the means of data with $y=0$

The likelihood function is given by $$ \log\prod_i^np(x_i|y_i)p(y_i) $$ where $$ p(x|y=0)=\frac{1}{2\pi^{d/2}|\Gamma|^{1/2}}\exp\left(\frac{-1}{2}(x-\mu_0)^T\Gamma^{-1}(x-\mu_0)\right) $$ $$ p(x|y=1)=\frac{1}{2\pi^{d/2}|\Gamma|^{1/2}}\exp\left(\frac{-1}{2}(x-\mu_1)^T\Gamma^{-1}(x-\mu_1)\right) $$ $$ p(y)=\phi^y(1-\phi)^{1-y} $$

to find the estimator, $\phi$, we need to take the derivative and set to 0

first rewrite likelihood function $$ \sum_i^n\log(p(x_i|y_i)p(y_i)) $$ $$ \sum_i^n\log p(x_i|y_i) + \log p(y_i) $$ take derivative with respect to $\phi$ and set to 0 $$ \sum_i^n\frac{y_i\phi^{y_i-1}(1-\phi)^{1-y_i}-(\phi^{y_i})(1-y_i)(1-\phi)^{-y_i}}{\phi^{y_i}(1-\phi)^{1-y_i}} = 0 $$ there are 2 cases $y_i=0, y_i=1$

when $y_i=1$ we get $\frac{1}{\phi}$

when $y_i=0$ we get $\frac{-1}{1-\phi}$

ultimately, this needs to get in the form of $$ \phi=\frac{1}{n}\sum_i^n\mathbb{1}\{y_i=1\} $$ if, for example, I had $$ y=\begin{bmatrix}1\\1\\0\end{bmatrix} $$ then we have $$ \frac{1}{\phi}+\frac{1}{\phi}-\frac{1}{1-\phi}=0 $$ $$ \phi=\frac{2}{3} $$ which is correct and what the indicator function would give but I can't reason how to formally write this as the indicator function

Sycorax
  • 90,934
jroc
  • 13
  • 3
  • Wait, I'm confused. $\mu_y$ is a scalar, right? But it appears that $x$ is a vector, due to the transpose in $(x - \mu_y)^T$. How are you adding and subtracting scalars with vectors? Can you [edit] to clarify your notation? What do your symbols mean? – Sycorax Mar 14 '23 at 17:37
  • How is $\mu_y$ related to $y$? What is the dimension of $y$? How is $y_i$ different from $y$ and $\phi_i$ different from $\phi$? The expression for $p(x|y)$ contains $\mu_y$ but your text describes two mean vectors $\mu_0$ and $\mu_1$. – Sycorax Mar 14 '23 at 18:26
  • sorry, edited to be more explicit – jroc Mar 14 '23 at 18:37
  • If $p(x | y)$ doesn't depend on $\phi$, then $p(x|y)$ is constant wrt $\phi$, so $\frac{d}{d\phi}\log p(x|y)=0$ & the task reduces to estimating the parameter of a Bernoulli distribution. This walks through the steps: https://stats.stackexchange.com/questions/149202/maximum-likelihood-estimation-of-p-in-a-binomial-sample – Sycorax Mar 14 '23 at 18:43

1 Answers1

1

$L(\phi) = \prod p(x_j, y_j; \phi)$ So $\log L = \ell(\phi) = \sum \log p(x_j, y_j; \phi)$

Since $p(x\mid y)$ does not depend on $\phi$, $p(x, y) = p(x \mid y)p(y) \propto p(y)$ so $\frac{d}{d\phi} \log p(y) = \frac{y}{\phi} -\frac{1-y}{1-\phi}$ gives $$ \phi(1-\phi)\frac{d}{d\phi}\ell(\phi) =\sum y_j(1-\phi) - (1-y_j)\phi = \sum \phi -y_j = n\phi - \sum y_j = 0 $$ and $\phi = \frac{1}{n}\sum y_j$.

Hunaphu
  • 2,212