I am reading Distributed Estimation, Information Loss and Exponential Families, where the authors consider and compare two estimators for $\theta$ in the parametric model $p(x\mid\theta)$:
- the global MLE estimator:
$$ \hat{\theta}^{\mathrm{MLE}}=\mathrm{arg\,max}_{\theta}\sum_{i\in[n]}\log p(x^i\mid\theta),\quad [n]\triangleq\{1,\cdots,n\} $$
- the KL-averaged estimator from the local MLE estimators:
$$ \hat{\theta}^{\mathrm{KL}}=\mathrm{arg\,min}_\theta\sum_k\mathrm{KL}\left(p(x\mid\hat{\theta}^k)\|p(x\mid\theta)\right) $$ where
$$ \hat{\theta}^k=\mathrm{arg\,max}_\theta\sum_{i\in\alpha^k}\log p(x^i\mid\theta),\quad \cup_k\alpha^k=[n],\,\alpha_i\cap\alpha_j=\emptyset\mbox{ if }i\ne j $$ The point here is to train local models and combine them with KL-divergence.
The paper states that for exponential families $p(x\mid\theta)=\exp(\theta^\mathrm{T}\phi(x)-\log Z(\theta))$, $\hat{\theta}^{\mathrm{KL}}=\hat{\theta}^\mathrm{MLE}$ by "directly verifying that the KL objective equals the global negative log-likelihood". But I don't know how to proceed after I have
$$ \begin{align*} &\sum_k\mathrm{KL}\left(p(x\mid\hat{\theta}^k)\|p(x\mid\theta)\right)\\ =&\sum_k\sum_{i\in[n]}p(x^i\mid\hat{\theta}^k)\left({\hat{\theta}^k}^{\mathrm{T}}\phi(x^i)-\theta^{\mathrm{T}}\phi(x^i)\right)-\sum_k\log Z(\hat{\theta}^k)+k\log Z(\theta) \end{align*} $$
Any hints?