KL-divergence as a negative log likelihood for exponential families

Question

I am reading Distributed Estimation, Information Loss and Exponential Families, where the authors consider and compare two estimators for $\theta$ in the parametric model $p(x\mid\theta)$:

the global MLE estimator:

$$ \hat{\theta}^{\mathrm{MLE}}=\mathrm{arg\,max}_{\theta}\sum_{i\in[n]}\log p(x^i\mid\theta),\quad [n]\triangleq\{1,\cdots,n\} $$

the KL-averaged estimator from the local MLE estimators:

$$ \hat{\theta}^{\mathrm{KL}}=\mathrm{arg\,min}_\theta\sum_k\mathrm{KL}\left(p(x\mid\hat{\theta}^k)\|p(x\mid\theta)\right) $$ where

$$ \hat{\theta}^k=\mathrm{arg\,max}_\theta\sum_{i\in\alpha^k}\log p(x^i\mid\theta),\quad \cup_k\alpha^k=[n],\,\alpha_i\cap\alpha_j=\emptyset\mbox{ if }i\ne j $$ The point here is to train local models and combine them with KL-divergence.

The paper states that for exponential families $p(x\mid\theta)=\exp(\theta^\mathrm{T}\phi(x)-\log Z(\theta))$, $\hat{\theta}^{\mathrm{KL}}=\hat{\theta}^\mathrm{MLE}$ by "directly verifying that the KL objective equals the global negative log-likelihood". But I don't know how to proceed after I have

$$ \begin{align*} &\sum_k\mathrm{KL}\left(p(x\mid\hat{\theta}^k)\|p(x\mid\theta)\right)\\ =&\sum_k\sum_{i\in[n]}p(x^i\mid\hat{\theta}^k)\left({\hat{\theta}^k}^{\mathrm{T}}\phi(x^i)-\theta^{\mathrm{T}}\phi(x^i)\right)-\sum_k\log Z(\hat{\theta}^k)+k\log Z(\theta) \end{align*} $$

Any hints?

KL-divergence as a negative log likelihood for exponential families

0 Answers0

Linked