Assume a linear regression model $y = X \theta^{*} + \epsilon$, where $X$ represents a feature matrix and $\theta$ represents a parameter vector.
Here we assume heteroskedasticity where $\epsilon \sim N(0, \Sigma^{*})$.
I also assume that $\Sigma^{*}$ is diagonal.
So far I got that the log-likelihood function can be simplified to something like this (when considering the argmax with respect to $\theta$): $$ LL = \sum_{i=1}^n \left( - \frac{1}{2}\ln(2\pi\sigma_{i}^{2}) - \frac{(y_i-(x_i \theta))^2}{2\sigma_{i}^2} \right)$$
... where $x_i$ represents the feature vector of observation $i$.
But if I try to take the derivative with respect to $\theta$, and set it to 0, I'm not able to come up with a closed form expression of the MLE $\hat{\theta}$, as I can't seem to separate out the $\sigma_{i}$'s.
Am I missing something obvious / is it even possible? And also, how much would things change if I do not assume $\Sigma^{*}$ to be diagonal?
Update:
Following advice in the comments, and assuming $\Sigma$ is diagonal, I've concluded that
$$\hat{\theta} = (X^T \Sigma^{-1} X)^{-1} X^T \Sigma^{-1} y$$
and that
$$\hat{\Sigma} = Diag(\hat{\sigma^2_1}, ..., \hat{\sigma^2_N})$$
where
$$\hat{\sigma^2_i} = (y_i - x_i^T \hat{\theta})^2$$
which may or may not be correct. But if they are roughly correct, then the MLE of $\theta$ and MLE of $\Sigma$ seems to depend on each other, and in practice, let's say we're given some data of $y$ and $X$, how would we actually go about using these formulas?
According to these notes, "estimating" the weights, or the error variances, seems to actually be one of the harder steps in practice. Then I guess there isn't much point of me working out the MLE of $\hat{\Sigma}$?