7

Most papers define the learning as maximizing the expectation given some function.

enter image description here

  • Is the expectation maximization applicable to most deep learning models in literature?

  • Can learning be generalized to EM algorithm?

  • Most of the models can be treated as probability functions, but when is this not the case?

  • Can we express every deep learning model as part of EM training algorithm?

iordanis
  • 485
  • 2
    EM algorithm is generally used when there is missing information, either an observable variable with some values not available, or you have a model with what's called a latent variable (unobservable). Otherwise direct estimation methods are faster. – Cagdas Ozgenc Feb 04 '20 at 20:25
  • 1
    Right but for a DNN we train in batches, wouldn't that make it equivalent. In all papers I read they define a DNN learning as an EM step. e.g. https://arxiv.org/pdf/1706.06083.pdf section 2 – iordanis Feb 04 '20 at 20:29

4 Answers4

14

(The original version of this post, the text of which I kept below the line for reference purposes, generated a lot of dispute and some back and forth which seems mostly to be around questions of interpretation and ambiguity, so I updated with a more direct answer)

The OP seems to be asking:

Are Deep Learning models a special case of the EM algorithm?

No they aren't. Deep Learning models are general purpose function approximators, which can use different types of objective functions and training algorithms, whereas the EM algorithm is a very specific algorithm in terms of training approach and objective function.

From this perspective, it is possible (although not very common) to use Deep Learning to emulate the EM algorithm. See this paper.

Most of the (deep learning) models can be treated as probability functions, but when is this not the case?

Probability distribution functions have to satisfy certain conditions such as summing up to one (the conditions are slightly different if you consider probability density functions). Deep Learning models can approximate functions in general - i.e. a larger class of functions than those that correspond to probability distributions and densities.

When do they not correspond to probability densities and distributions? Any time the function they approximate doesn't satisfy the axioms of probability theory. For example a network whose output layer has $tanh$ activations can take negative values, and therefore doesn't satisfy the condition for being a probability distribution or density.

There are three ways that a deep learning model can correspond to a probability distribution $P(x)$:

  • Use Deep Learning to learn a probability distribution directly. That is, have your neural network learn the shape of $y=P(x)$. Here $P(x)$ satisfies the conditions for being a probability density or distribution.

  • Have a Deep Learning model learn a general function $y=f(x)$ (that doesn't satisfy the conditions for being a probability distribution). After training the model, we then make assumptions about the probability distribution $P(y|x)$, e.g. the errors are normally distributed, and then use simulations to sample from that distribution. See here for an example of how that can be done.

  • Have a Deep Learning model learn a general function $y=f(x)$ (that doesn't satisfy the conditions for being a probability distribution) - and then interpret the output of the model as representing $P(y|x) = ~ \delta[y-f(x)]$, with $\delta$ being the Dirac function. There are two issues with this last approach. There is some debate as to whether the Dirac Delta constitutes a valid distribution function. It is popular in the signal processing and physics communities, but not so much among the probability and statistics crowd. It also doesn't provide any useful information from a probability and statistics point of view, since it doesn't provide anyway of quantifying the uncertainty of the output, which defeats the purpose of using a probabilistic model in practice.


Is the expectation maximization applicable to most deep learning models in literature?

Not really. There are several key differences:

  • Deep Learning models work by minimizing a loss function. Different loss functions are used for different problems, and then the training algorithm used focuses on the best way to minimize the particular loss function that is suitable for the problem at hand. The EM algorithm on the other hand, is about maximizing a likelihood function. The issue here isn't simply that we are maximizing instead of minimizing (both are optimization problems after all), but the fact that EM dictates a specific function to be optimized, whereas Deep Learning can use any loss function as long as it is compatible with the training method (which is usually some variant of Gradient Descent).
  • EM estimates the parameters in a statistical method by maximizing the likelihood of those parameters. So we chose the model before hand (e.g. a Gaussian with mean $\mu$ and variance $\sigma^2$), and then we use EM to find the best values of those parameters (e.g. which values of $\mu$ and $\sigma^2$ best fit our data). Deep Learning models are non-parametric, they don't make any assumptions about the shape or distribution of the data. Instead they are universal approximators, which given enough neurons and layers, should be able to fit any function.
  • Closely related to the previous point is the fact that Deep Learning models are just function approximators, that can approximate arbitrary functions without having to respect any of the constraints that are imposed on a probability distribution function. An MLE model, or even a non-parametric distribution estimator for that matter, will be bound by the laws of probability and the constraints imposed on probability distributions and densities.

Now certain types of deep learning models can be considered equivalent to an MLE model, but what is really happening under the hood is that we specifically asking the neural network to learn a probability distribution as opposed to a more general arbitrary function by choosing certain activation functions and adding some constraints on the outputs of the network. All that means is that they are acting as MLE estimators, but not that they are special cases of the EM algorithm.

Is the learning considered to be part of the EM algorithm?

I would say that it is the other way around. It is possible that someone, somewhere, has come up with a Deep Learning model that is equivalent to the EM algorithm, but that would make the EM algorithm a special case of Deep Learning, not the other way around, since for this to work, you would have to use Deep Learning + additional constraints to make the model mimic EM.


In response to the comments:

  1. "Minimizing and maximizing can be the same thing.": Agreed, they can be (almost) equivalent - which what I specified in my response - it is NOT about maximizing vs. minimizing, it is about having to use a specific objective function dictated by MLE, vs. being able to use just about any loss function compatible with backpropagation. "The loss function in this case is the expectation of E p_theta(x|z) where p_theta is the deep neural network model." - again this is possible, but as I point out later, this would make MLE a special case of Deep Learning, not the other way around.
  2. "Parameters in the case of the deep neural networks are the model weights. I don't think your explanation is correct" - the explanation is correct, but you are also correct that the word parametric is ambiguous, and is used in different ways in different sources. Model weights are parameters in the general sense, but in the strict sense of parametric vs. non-parametric models, they aren't the same as the parameters in a parametric model. In parametric model, the parameters have a meaning, they correspond to the mean, the variance, the seasonality in a time series, etc...whereas the parameters in a Deep Learning model don't having any meaning, they are jus the most convenient way for the network to store information. That is why neural networks are criticized for being black box - there is no established way of interpreting the meaning of the weights. Another way you can think of it is in terms of total parameters vs. number of effective parameters: In a truly parametric model that is estimated using EM, the number of fixed parameters is the same as the number of effective parameters. In a neural network, the number of effective parameters may change during training (by reducing weights to zero, or by using drop out, etc....), even if the total number of parameters is defined before hand and is fixed. Which brings us to the real difference between the two approaches: A fixed number of effective parameters means that the shape of the distribution or function is decided before hand, whereas changing effective parameters allows for models to approximate more general, and eventually arbitrary functions, per the universal approximation theorem.
  3. "DNN also try to learn the probability distribution of the data in order to make predictions." only if we configure and constrain them to learn probability distributions. But they can also learn other things besides probability distributions. To this how this is possible, you can simply specify a multi-class neural network, with 4 outputs, with sigmoid outputs instead of softmax outputs, and train it to learn cases where the output is [1, 1, 1, 1]. Since the sum of the outputs is > 1, this is not a probability distribution, but just an arbitrary mapping of the inputs to classes. More generally Neural Networks/Deep Learning models are just general purpose function approximators, which can be configured to the specific case of estimation probability distribution functions, but they are not limited to that case. In computer vision for example, the are often used as filters and segmentation devises, instead of as classifiers or distribution estimators.

As Cagdas Ozgenc points out, just about any supervised learning problem or function approximation problem can be recast as an MLE.

Skander H.
  • 11,888
  • 2
  • 41
  • 97
  • 1
  • Minimizing and maximizing can be the same thing. e.g. if you are minimizing the negative of a function then you are maximizing that function so I don't agree with you there. The loss function in this case is the expectation of E p_theta(x|z) where p_theta is the deep neural network model.
  • – iordanis Feb 04 '20 at 19:37
  • Parameters in the case of the deep neural networks are the model weights. I don't think your explanation is correct.
  • – iordanis Feb 04 '20 at 19:38
  • DNN also try to learn the probability distribution of the data in order to make predictions.
  • – iordanis Feb 04 '20 at 19:39