This is how I think about it:
$$
D_{KL}(p(y_i | x_i) \:||\: q(y_i | x_i, \theta)) = H(p(y_i | x_i, \theta), q(y_i | x_i, \theta)) - H(p(y_i | x_i, \theta)) \tag{1}\label{eq:kl}
$$
where $p$ and $q$ are two probability distributions. In machine learning, we typically know $p$, which is the distribution of the target. For example, in a binary classification problem, $\mathcal{Y} = \{0, 1\}$, so if $y_i = 1$, $p(y_i = 1 | x) = 1$ and $p(y_i = 0 | x) = 0$, and vice versa. Given each $y_i \: \forall \: i = 1, 2, \ldots, N$, where $N$ is the total number of points in the dataset, we typically want to minimize the KL divergence $D_{KL}(p,q)$ between the distribution of the target $p(y_i | x)$ and our predicted distribution $q(y_i | x, \theta)$, averaged over all $i$. (We do so by tuning our model parameters $\theta$. Thus, for each training example, the model is spitting out a distribution over the class labels $0$ and $1$.) For each example, since the target is fixed, its distribution never changes. Thus, $H(p(y_i | x_i))$ is constant for each $i$, regardless of what our current model parameters $\theta$ are. Thus, the minimizer of $D_{KL}(p,q)$ is equal to the minimizer of $H(p, q)$.
If you had a situation where $p$ and $q$ were both variable (say, in which $x_1\sim p$ and $x_2\sim q$ were two latent variables) and wanted to match the two distributions, then you would have to choose between minimizing $D_{KL}$ and minimizing $H(p, q)$. This is because minimizing $D_{KL}$ implies maximizing $H(p)$ while minimizing $H(p, q)$ implies minimizing $H(p)$. To see the latter, we can solve equation (\ref{eq:kl}) for $H(p,q)$:
$$
H(p,q) = D_{KL}(p,q) + H(p) \tag{2}\label{eq:hpq}
$$
The former would yield a broad distribution for $p$ while the latter would yield one that is concentrated in one or a few modes. Note that it is your choice as a ML practitioner whether you want to minimize $D_{KL}(p, q)$ or $D_{KL}(q, p)$. A small discussion of this is given in the context of variational inference (VI) below.
In VI, you must choose between minimizing $D_{KL}(p,q)$ and $D_{KL}(q,p)$, which are not equal since KL divergence is not symmetric. If we once again treat $p$ as known, then minimizing $D_{KL}(p, q)$ would result in a distribution $q$ that is sharp and focused on one or a few areas while minimizing $D_{KL}(q, p)$ would result in a distribution $q$ that is wide and covers a broad range of the domain of $q$. Again, the latter is because minimizing $D_{KL}(q, p)$ implies maximizing the entropy of $q$.
I will die eventuallyis almost certain, therefore it has low entropy". Not sure what you meant to write here, but technically speaking an event has no entropy. You can define its information, and you can measure the entropy of the distribution or the system. The statementI will die eventuallyisn't an event either. – Amelio Vazquez-Reina May 30 '20 at 20:23