Expectation Maximization and Deep Learning

Question

Most papers define the learning as maximizing the expectation given some function.

Is the expectation maximization applicable to most deep learning models in literature?
Can learning be generalized to EM algorithm?
Most of the models can be treated as probability functions, but when is this not the case?
Can we express every deep learning model as part of EM training algorithm?

EM algorithm is generally used when there is missing information, either an observable variable with some values not available, or you have a model with what's called a latent variable (unobservable). Otherwise direct estimation methods are faster. — Cagdas Ozgenc, Feb 04 '20 at 20:25
Right but for a DNN we train in batches, wouldn't that make it equivalent. In all papers I read they define a DNN learning as an EM step. e.g. https://arxiv.org/pdf/1706.06083.pdf section 2 — iordanis, Feb 04 '20 at 20:29

Skander H. · Answer 1 · 2020-03-19T06:44:35.587

(The original version of this post, the text of which I kept below the line for reference purposes, generated a lot of dispute and some back and forth which seems mostly to be around questions of interpretation and ambiguity, so I updated with a more direct answer)

The OP seems to be asking:

Are Deep Learning models a special case of the EM algorithm?

No they aren't. Deep Learning models are general purpose function approximators, which can use different types of objective functions and training algorithms, whereas the EM algorithm is a very specific algorithm in terms of training approach and objective function.

From this perspective, it is possible (although not very common) to use Deep Learning to emulate the EM algorithm. See this paper.

Most of the (deep learning) models can be treated as probability functions, but when is this not the case?

Probability distribution functions have to satisfy certain conditions such as summing up to one (the conditions are slightly different if you consider probability density functions). Deep Learning models can approximate functions in general - i.e. a larger class of functions than those that correspond to probability distributions and densities.

When do they not correspond to probability densities and distributions? Any time the function they approximate doesn't satisfy the axioms of probability theory. For example a network whose output layer has $tanh$ activations can take negative values, and therefore doesn't satisfy the condition for being a probability distribution or density.

There are three ways that a deep learning model can correspond to a probability distribution $P(x)$:

Use Deep Learning to learn a probability distribution directly. That is, have your neural network learn the shape of $y=P(x)$. Here $P(x)$ satisfies the conditions for being a probability density or distribution.
Have a Deep Learning model learn a general function $y=f(x)$ (that doesn't satisfy the conditions for being a probability distribution). After training the model, we then make assumptions about the probability distribution $P(y|x)$, e.g. the errors are normally distributed, and then use simulations to sample from that distribution. See here for an example of how that can be done.
Have a Deep Learning model learn a general function $y=f(x)$ (that doesn't satisfy the conditions for being a probability distribution) - and then interpret the output of the model as representing $P(y|x) = ~ \delta[y-f(x)]$, with $\delta$ being the Dirac function. There are two issues with this last approach. There is some debate as to whether the Dirac Delta constitutes a valid distribution function. It is popular in the signal processing and physics communities, but not so much among the probability and statistics crowd. It also doesn't provide any useful information from a probability and statistics point of view, since it doesn't provide anyway of quantifying the uncertainty of the output, which defeats the purpose of using a probabilistic model in practice.

Is the expectation maximization applicable to most deep learning models in literature?

Not really. There are several key differences:

Deep Learning models work by minimizing a loss function. Different loss functions are used for different problems, and then the training algorithm used focuses on the best way to minimize the particular loss function that is suitable for the problem at hand. The EM algorithm on the other hand, is about maximizing a likelihood function. The issue here isn't simply that we are maximizing instead of minimizing (both are optimization problems after all), but the fact that EM dictates a specific function to be optimized, whereas Deep Learning can use any loss function as long as it is compatible with the training method (which is usually some variant of Gradient Descent).
EM estimates the parameters in a statistical method by maximizing the likelihood of those parameters. So we chose the model before hand (e.g. a Gaussian with mean $\mu$ and variance $\sigma^2$), and then we use EM to find the best values of those parameters (e.g. which values of $\mu$ and $\sigma^2$ best fit our data). Deep Learning models are non-parametric, they don't make any assumptions about the shape or distribution of the data. Instead they are universal approximators, which given enough neurons and layers, should be able to fit any function.
Closely related to the previous point is the fact that Deep Learning models are just function approximators, that can approximate arbitrary functions without having to respect any of the constraints that are imposed on a probability distribution function. An MLE model, or even a non-parametric distribution estimator for that matter, will be bound by the laws of probability and the constraints imposed on probability distributions and densities.

Now certain types of deep learning models can be considered equivalent to an MLE model, but what is really happening under the hood is that we specifically asking the neural network to learn a probability distribution as opposed to a more general arbitrary function by choosing certain activation functions and adding some constraints on the outputs of the network. All that means is that they are acting as MLE estimators, but not that they are special cases of the EM algorithm.

Is the learning considered to be part of the EM algorithm?

I would say that it is the other way around. It is possible that someone, somewhere, has come up with a Deep Learning model that is equivalent to the EM algorithm, but that would make the EM algorithm a special case of Deep Learning, not the other way around, since for this to work, you would have to use Deep Learning + additional constraints to make the model mimic EM.

In response to the comments:

"Minimizing and maximizing can be the same thing.": Agreed, they can be (almost) equivalent - which what I specified in my response - it is NOT about maximizing vs. minimizing, it is about having to use a specific objective function dictated by MLE, vs. being able to use just about any loss function compatible with backpropagation. "The loss function in this case is the expectation of E p_theta(x|z) where p_theta is the deep neural network model." - again this is possible, but as I point out later, this would make MLE a special case of Deep Learning, not the other way around.
"Parameters in the case of the deep neural networks are the model weights. I don't think your explanation is correct" - the explanation is correct, but you are also correct that the word parametric is ambiguous, and is used in different ways in different sources. Model weights are parameters in the general sense, but in the strict sense of parametric vs. non-parametric models, they aren't the same as the parameters in a parametric model. In parametric model, the parameters have a meaning, they correspond to the mean, the variance, the seasonality in a time series, etc...whereas the parameters in a Deep Learning model don't having any meaning, they are jus the most convenient way for the network to store information. That is why neural networks are criticized for being black box - there is no established way of interpreting the meaning of the weights. Another way you can think of it is in terms of total parameters vs. number of effective parameters: In a truly parametric model that is estimated using EM, the number of fixed parameters is the same as the number of effective parameters. In a neural network, the number of effective parameters may change during training (by reducing weights to zero, or by using drop out, etc....), even if the total number of parameters is defined before hand and is fixed. Which brings us to the real difference between the two approaches: A fixed number of effective parameters means that the shape of the distribution or function is decided before hand, whereas changing effective parameters allows for models to approximate more general, and eventually arbitrary functions, per the universal approximation theorem.
"DNN also try to learn the probability distribution of the data in order to make predictions." only if we configure and constrain them to learn probability distributions. But they can also learn other things besides probability distributions. To this how this is possible, you can simply specify a multi-class neural network, with 4 outputs, with sigmoid outputs instead of softmax outputs, and train it to learn cases where the output is [1, 1, 1, 1]. Since the sum of the outputs is > 1, this is not a probability distribution, but just an arbitrary mapping of the inputs to classes. More generally Neural Networks/Deep Learning models are just general purpose function approximators, which can be configured to the specific case of estimation probability distribution functions, but they are not limited to that case. In computer vision for example, the are often used as filters and segmentation devises, instead of as classifiers or distribution estimators.

As Cagdas Ozgenc points out, just about any supervised learning problem or function approximation problem can be recast as an MLE.

Sextus Empiricus · Answer 2 · 2020-02-06T01:44:21.210

Short overview about Expectation maximization :

Marginal likelihood

Expectation maximization contrasts with 'regular' likelihood maximization by refering to the maximization of a marginal likelihood.

$$\underbrace{p(X\vert \theta)}_{\substack{\text{marginal likelihood}\\\text{ $\mathcal{L}(\theta \vert X)$}}} = \int_z \underbrace{p(X, z \vert \theta)}_{\substack{\text{likelihood}\\\text{ $\mathcal{L}(\theta \vert X,\underset{\uparrow \\ \substack{\llap{\text{This $z$ is }\rlap{\text{missing data}}}}}{z})$}}} \text{d}z = \int_z p(X \vert \theta,z) p(z\vert X,\theta) \text{d}z $$

So this relates to an integral over some likelihood with an additional parameter $z$ (e.g. missing data).
EM algorithm

In the EM algorithm this integral is not maximized directly:

$$\hat\theta = \underset{\theta}{\text{arg max}} \left( \int_z p(X \vert \theta,z) p(z\vert X,\theta) \text{d}z \right)$$

but instead it is computed in an iterative way:

$$\hat\theta_{k+1} = \underset{\theta}{\text{arg max}} \left( \int_z p(X \vert \theta,z) p(z\vert X,\hat\theta_k) \text{d}z \right)$$

This is done by picking an initial $\theta_1$ and updating repetitively. Note that now the optimization keeps the term $p(z\vert X,\theta)$ fixed (which helps to minimize by expressing derivatives).

Not all problems are like that.

So this marginal likelihood is only the case for problems with unobserved data. For instance in finding a Gaussian Mixture (example on wikipedia) one may consider an observed variable $z$ that refers the class (which component in the mixture) that a measurement belongs to.

There are many problems that do not consider a marginal likelihood and evaluate parameters directly in order to optimize some likelihood function (or some other cost function but not a marginalized/expectation one).

Code example

The use of marginal likelihood is not always about explicitly missing data.

In the example below the (example on wikipedia) is worked out numerically in R.

In this case you do not have explicitly a case with missing data (the likelihood can be defined directly and does not need to be a marginal likelihood integrating over missing data), but one has a mixture of two multivariate Gaussian distributions. The problem with that is that one can not do as normally and compute the sum of the logarithm of the terms (which has computational advantages), because not the terms are not a product, but they involve a sum as well. Although, it would be possible to compute those logarithms of sums using an approximation (this is not done in the code below, instead the parameters are created in order to be advantageous and do not generate infinite values).

library(MASS)

# data example
set.seed(1)
x1 <- mvrnorm(100, mu=c(1,1), Sigma=diag(1,2))
x2 <- mvrnorm(30,  mu=c(3,3), Sigma=diag(sqrt(0.5),2))
x <- rbind(x1,x2)
col <- c(rep(1,100),rep(2,30))
plot(x,col=col)

# Likelihood without integrating over z 
Lsimple <- function(par,X=x) {
  tau <- par[1]
  mu1 <- c(par[2],par[3])
  mu2 <- c(par[4],par[5])
  sigma_1 <- par[6]
  sigma_2 <- par[7]

  likterms <-    tau*dnorm(X[,1],mean = mu1[1], sd = sigma_1)*
                     dnorm(X[,2],mean = mu1[2], sd = sigma_1)+
             (1-tau)*dnorm(X[,1],mean = mu2[1], sd = sigma_2)*
                     dnorm(X[,2],mean = mu2[2], sd = sigma_2)
  logLik = sum(log(likterms))
  -logLik
}

# Marginal likelihood integrating over z
LEM <- function(par,X=x,oldp=oldpar) {
  tau <- par[1]
  mu1 <- c(par[2],par[3])
  mu2 <- c(par[4],par[5])
  sigma_1 <- par[6]
  sigma_2 <- par[7]

  oldtau <- oldp[1]
  oldmu1 <- c(oldp[2],oldp[3])
  oldmu2 <- c(oldp[4],oldp[5])
  oldsigma_1 <- oldp[6]
  oldsigma_2 <- oldp[7]

  f1 <-     oldtau*dnorm(X[,1],mean = oldmu1[1], sd = oldsigma_1)*
                   dnorm(X[,2],mean = oldmu1[2], sd = oldsigma_1)
  f2 <- (1-oldtau)*dnorm(X[,1],mean = oldmu2[1], sd = oldsigma_2)*
                   dnorm(X[,2],mean = oldmu2[2], sd = oldsigma_2)
  pclass <- f1/(f1+f2)

  ### note that now the terms are a product and can be replaced by a sum of the log
  #likterms <-    tau*dnorm(X[,1],mean = mu1[1], sd = sigma_1)*
  #                   dnorm(X[,2],mean = mu1[2], sd = sigma_1)*(pclass)*
  #           (1-tau)*dnorm(X[,1],mean = mu2[1], sd = sigma_2)*
  #                   dnorm(X[,2],mean = mu2[2], sd = sigma_2)*(1-pclass)
  loglikterms <-   (log(tau)+dnorm(X[,1],mean = mu1[1], sd = sigma_1, log = TRUE)+
                             dnorm(X[,2],mean = mu1[2], sd = sigma_1, log = TRUE))*(pclass)+
                 (log(1-tau)+dnorm(X[,1],mean = mu2[1], sd = sigma_2, log = TRUE)+
                             dnorm(X[,2],mean = mu2[2], sd = sigma_2, log = TRUE))*(1-pclass)
  logLik = sum(loglikterms)
  -logLik
}


# solving with direct likelihood
par <- c(0.5,1,1,3,3,1,0.5)
p1 <- optim(par, Lsimple, 
          method="L-BFGS-B",
          lower = c(0.1,0,0,0,0,0.1,0.1),
          upper = c(0.9,5,5,5,5,3,3),
          control = list(trace=3, maxit=10^3)) 
p1

# solving with LEM 
# (this is done here indirectly/computationally with optim,
# but could be done analytically be expressing te derivative and solving)
oldpar <- c(0.5,1,1,3,3,1,0.5)

for (i in 1:100) {
  p2 <- optim(oldpar, LEM, 
             method="L-BFGS-B",
             lower = c(0.1,0,0,0,0,0.1,0.1),
             upper = c(0.9,5,5,5,5,3,3),
             control = list(trace=1, maxit=10^3))  
  oldpar <- p2$par
  print(i)
}
p2

# the result is the same:
p$par
p2$par

score 2 · Answer 3 · answered Feb 04 '20 at 21:44

In short, no.

Expectation maximization is a technique to solve statistical problems that consist of an "easy" maximization (if some latent variables were known), and an "easy" expectation calculation on the log-likelihood (if the parameters were known). However, the "how" and "why" the expectation and maximization steps require ingenuity, and are model-specific. So while it's possible that some models from deep learning could be posed in fashion that might leverage EM, EM is not a generic optimization technique, not even to classical statistical models.

EM as minorization/maximization

However, EM can be considered a member of a class of algorithms known as minorization-maximization (MM) algorithms. These algorithms find a surrogate that is a lower bound for the objective function everywhere, but tight at least one point. The surrogate is maximized (it should be constructed so that it's easier to maximize than the original function), and the process repeated. Finding such a surrogate also requires ingenuity or structure, but it can be thought as a generic technique in optimization. So in that sense, the theory behind EM is broadly be applicable.

A quick search of google scholar reveals some relevant literature, though it seems to be much less commonly used than stochastic gradient methods, which do not attempt to construct a surrogate.

I don't understand why you draw a distinction between EM and Gradient Descent. They seem to me 2 different things. We can take the gradient of the Expectation step and for the maximization (or minimization depending on the sign) we can consider the gradient since we can only do a batch at a time. — iordanis, Feb 04 '20 at 22:42
@lordanis, many models that apply EM don't require an explicit calculation of the gradient of the expected complete likelihood because the maximization is done exactly and analytically. For instance, the Gaussian finite mixture model. I don't understand what you mean by a "gradient of the expectation step." — Andrew M, Feb 06 '20 at 16:37

score -1 · Answer 4 · answered Mar 28 '22 at 15:33

Given the previous technical answers, one philosophical point may help to clear the ambiguity between ME vs. Deep Learning: the concept of learning.

In deep learning, there are multiple layers, each layer is a learning step. In the first step, the data input is 'converted' (or learned) into a synthetic intermediate output (a bit higher abstraction, loosely speaking). Then in each step, the input is progressively learned ( or 'transformed') into higher abstraction features, which may or may not be comprehensible to humans. Loosely speaking, the combination of these layers will approximate the 'formulas' for you, we don't need to specify any model or hypothesis beforehand. This is the same 'learning' concept as in cognitive science: construct higher abstraction from inputs.

In ME methodology, there is simply no increase in abstraction.

One example is word embedding in NLP, where words are transformed progressively into vectors of numbers with increasing abstraction, such that in the end, the vectors of numbers can actually represent syntactic (grammar) and semantic (logic) meaning. This capacity to deal with ambiguity (in languages, or other use cases) by building abstracted features is one stark distinction of deep learning vs. many other statistical methods.