My supervisor has written some (relatively draft-like) R code that implements an idea from Optimal Experiment Design. However, during instead of using KL divergence
$D_{KL}(P(M | Y = y) || P(M))$
as proposed in the paper (similar to the step on page 3, return KL( mPosterior , mPrior )),
he directly used the absolute value of the entropy between the two distributions
$abs(H(P(M | Y = y)) - H(P(M)))$
I haven't got a chance to ask him about the intention of doing so yet (he's busy), so I'd like to post a question here first.
I already ran some experiments based on this code and it seems to be working and indeed has optimized the process (i.e. I now need fewer trials to have my models/beliefs converge compared to just producing trials from a flat prior).
However, since I haven't seen this formulation of just using the absolute value of the difference between the entropies of two distributions, I'd like to know what the theoretical basis of it is, why it works and why one might want to choose this instead of KL divergence. Or did I just not understand it thoroughly and this is actually just another slightly varied form of KL divergence/cross entropy? I'm relatively weak in terms of stats background.
return KL( mPosterior , mPrior ), i.e.D_{KL}(P(M | Y = y) || P(M))since we're trying to compare the entropy of the beliefs that we have obtained from all the experiment trial results so far, with the entropy of the beliefs updated by just one more experiment trial result, and therefore trying to give an optimal probability distribution from which the next experiment trial will be sampled. – xji Nov 17 '17 at 14:23