4

In supervised machine learning, we are given a set of data $\{x_i\}_{i=1}^N$ where each data is associated with a label $\{y_i\}_{i = 1}^N$.

We would like to create/train functions $f$ to do several things.

  1. create a best-fit curve: $f: x_i \to y_i$ such that a metric (training error) is minimized.

  2. regression: assign a real value to a new data

  3. classification: assign a discrete label to a new data

These are the most common tasks that a supervised machine learning do. The deep neural network is the most popularly used, general functional form $f$ that can do these tasks (which encapsulates linear, logistic, multi-class logistic, probit regression), among others.

Observe one key thing in my description: I did not mention anything about the existence of a "statistical model" that so many machine learning authors discuss in their textbooks.


By statistical model, I refer some prescribed relationship between the set of data and the labels as described in the first sentence.

For example: according to https://statmath.wu.ac.at/courses/heather_turner/glmCourse_001.pdf

A linear model is: $y_i = \beta_0 + \beta_1^T x_i + \epsilon_i$ where $\epsilon_i$ is a noise term.

A general linear model is: $y_i = \beta_0 + \beta_1^T x_{1i} + \ldots + \beta_p^T x_{pi} + \epsilon_i$

A logistic regression model is: $\log(y_i) = \beta_0 + \beta_1^T x_{i} + \epsilon_i$

There are also non-parametric, semi-parametric models, all following the following form $y_i = m(x_i) + \epsilon_i$ where $m$ is some function.


I am confused why this discussion of statistical model is needed, especially in the context where the authors are trying to use these statistical model to essentially do the same thing as machine learning models.

Here are some related concerns:

  • Intuitively, in the real-world we do not know what is the relationship between label and data. So why would bother creating a relationship between them?

  • In no part of modern machine learning training/validation/testing routine do we require the assumption of a clear relationship between $x_i$ and $y_i$. It literally does not provide us with any additional information.

  • In modern deep supervised machine learning, the concept of a model is not even remotely mentioned. For example, it doesn't seem to make sense to say things such as "what's the statistical model of GoogLeNet", or "what's the statistical model of LSTM". I mean, what is the functional form of $m$ and noise assumption on $\epsilon_i$, $y_i = m(x_i) + \epsilon_i$ for things like DenseNet or U-Net?

Can someone please help me understand this?

  • 5
    All of the examples you give of “statistical models” are linear when $x$ is fixed before the coefficients are estimated. Is this intentional? Because there are many other types of models that are not linear; decision trees optimize for step functions that are good at predicting $y$, for instance. If we open up “statistical model” to any one of a large family of functions $m(x)$, then $m$ includes U-Net and LSTMs, and it’s hard to understand what the question is, because you stated at the outset that you want to predict $y$. I don’t see how you can do that without some function $m$. – Sycorax Jul 12 '23 at 00:08
  • 4
    How do you propose to predict regression values or classification labels without a model? // What if you called it a “machine learning” model instead of a “statistical” model? – Dave Jul 12 '23 at 00:18
  • @Dave I am talking about the relationship between $x_i$ and $y_i$. Why do we need to assume the existence of such a relationship and in that particular form? (See: https://en.wikipedia.org/wiki/Statistical_model under nested model). It says clearly, that the linear model is $y = b_0 + b_1 x + \epsilon$, $\epsilon$ is Gaussian. We do not need to assume this linear model (again: $y = ... + \epsilon$) to do prediction. – Curaçao Hajek Jul 12 '23 at 03:31
  • @Sycorax-OnStrike I'm sorry but your argument is not correct. Please consider: Suppose $m$ is U-net, then what is the relationship between $y$ and $x$? Can we write $y_i = m(x_i) + \epsilon_i$ for some $\epsilon$ noise? No! Because this assumes that all the error between $y_i$ and the output of U-net $m(x_i)$ is in that $\epsilon$ noise. But this noise does not need to be something that's added to the output $m(x_i)$ (it could be multiplicative, or mapped through some nonlinear function). This noise certainly do not need to come from some standard distribution. – Curaçao Hajek Jul 12 '23 at 03:33
  • 1
    I certainly didn't mean to suggest that one is limited to solely additive error models, but this seems incidental to your question. Naturally, there are alternative models. For instance, the error in $\log y = \beta^T x + \epsilon$ is not additive; on the original scale $y$, the error is multiplicative. Are you asking for examples of models that do not have additive errors? ([tag:Logistic] regression is an example.) It seems that you know such models exist, so I don't know what else an answer could provide. – Sycorax Jul 12 '23 at 03:47
  • 2
    An unstated portion of your question is what loss is minimized. My comment on your previous question pointed out that the connection between MSE as a loss function and the Gaussian log-likelihood. Writing down $y = \beta^T x + \epsilon$ doesn't, on its own, tell us what the optimal coefficients $\beta$ are -- we need a loss function to do that. Minimizing average error, median errors, or other quantities give rise to alternative optimal $\beta$s. So this is one place where "statistical models" enter appear in machine learning. Does this answer your question? – Sycorax Jul 12 '23 at 04:08
  • @Sycorax-OnStrike Thanks for your reply: to your second-to-last comment. Please provide me with the explicit form of the statistical model of "Highway Network" (https://arxiv.org/pdf/1505.00387.pdf). Note that the linear regression model $y = b_0 + b_1 x + \epsilon$ is a subset of this type of network with appropriate network connections and activation functions. – Curaçao Hajek Jul 14 '23 at 20:33
  • @Sycorax-OnStrike You seem to insist that there is some connection between loss function and a certain distribution, e.g., Gaussian log likelihood. This is not true in general and certainly does not assist us in modern machine learning tasks; in ML, we do not known any distribution. Recent research has found that the loss function is a function of the geometry and does not require linking with any likelihood or probability distribution. Please see table 1 https://jmlr.csail.mit.edu/papers/volume21/19-021/19-021.pdf (no Gaussian or any other loglikelihood can be found). – Curaçao Hajek Jul 14 '23 at 20:36
  • 5
    You've entirely lost the plot. The models in the highway network paper are trained using categorical cross-entropy loss. Categorical cross-entropy loss is an example of a cross-entropy loss function, so it has a direct and obvious relationship to a specific probability distribution. See: https://stats.stackexchange.com/questions/378274/how-to-construct-a-cross-entropy-loss-for-general-regression-targets The fact that the paper uses a highway network is entirely incidental to the present discussion -- it's just a particularly sophisticated type of basis expansion of the input. – Sycorax Jul 14 '23 at 21:12
  • 2
    So we can write the highway model as $p_i = m(x_i)$ s.t. $\sum_j p_{ij}=1$ and $p_{ij} > 0 \forall j$ whence $y_i \sim \text{Categorical}(p_i)$, where $m$ is the highway network. The most charitable reading of your comments is that you've retreated from "statistical models are irrelevant to machine learning" to "not all machine learning models can be written as optimizations of probability models," which seems like a matter you're perfectly capable of investigating on your own. The trouble main is that the examples you chose in your comments and question are probability models. – Sycorax Jul 14 '23 at 21:14
  • @Sycorax-OnStrike I have not lost the plot. I am asserting that an analytical expression between the data input $x$ and the true label $y$ doesn't even exist in general, hence (back to the title of my question), it is irrelevant. Can you not see that the picture of a cat and an integer representing the class of a cat (out of other classes), has no relationship in general? – Curaçao Hajek Jul 15 '23 at 00:30
  • @Sycorax-OnStrike Your very last comment shows your misunderstanding. You have pre-emptively assumed that the network cannot be trained by any other loss except for categorical cross-entropy. You can train any neural network with mean-squared loss, period. The "optimal/right" error function/lost function is a matter of geometry and does not depend on the notion of a statistical model. – Curaçao Hajek Jul 15 '23 at 00:31
  • 1
    @CuraçaoHajek If you train by minimizing square loss, then you’re just using some other extremum estimator. – Dave Jul 15 '23 at 00:42
  • 5
    You asked me about the paper, and I wrote a comment about the paper. Using MSE to train a neural network is a different model, not described in the paper. I find it hard to believe that a person using a computer has a principled objection to using numbers to represent images or words. Your objection actually reduces to an argument against indexing, which might be “unnatural” by your invented criteria, but appears everywhere in math. Quite frankly, this has become tedious and your arguments are simply specious. I will absent myself at this time. – Sycorax Jul 15 '23 at 01:59
  • 5
    "In modern deep supervised machine learning, the concept of a model is not even remotely mentioned." unfortunately a lot of people working with deep learning methods are unaware of previous work on neural networks, or of statistics. Just because these things are not mentioned, doesn't mean they aren't there (and deep learning would be better if it did remember it's origins rather better). This is nothing new, the early days of neural networks were largely ignorant of the statistical aspects as well. – Dikran Marsupial Jul 16 '23 at 15:38
  • 3
    "You can train any neural network with mean-squared loss, period." indeed, but the optimal choice of loss function is part of statistical modelling. The form of the model (especially when comparing universal approximators, such as NN), is essentially irrellevant and the concerns are usually computational rather than statistical. – Dikran Marsupial Jul 16 '23 at 15:41
  • 4
    BTW some statistical models (such as Gaussian Processes) do not have a predetermined mathematical form - indeed there is no single model as such - the predictions are based on the calibration data and the hyper-parameter values and the model itself is marginalised (integrated over). You appear not to understand what is meant by a statistical model, the problem is the question appears to be based on a misunderstanding. – Dikran Marsupial Jul 16 '23 at 15:47

4 Answers4

10

You seem to be misunderstanding what we mean by "models".

Imagine that you are observing a horse race. At one point you say, "Horse number 3 is gonna win!". How did you draw the conclusion? You did not have perfect knowledge about each muscle in each horse taking part in the race, about their breaths and heartbeats, each sand grain on the race track, each thought and feeling of each horse and jockey, each sound and wind blow that could affect the result, etc. Even if you had access to such knowledge, the information would not fit the working memory of your brain. Moreover, even if you had the information and it fitted your working memory, we do not have a perfect understanding of the universe (physics, psychology, physiology, etc) so given the facts we could combine them and make a conclusion about the inevitable result of the race. If we did, we would not do horse racing, because the results would be known and advance and not interesting anymore.

What you do when you say "Horse number 3 is gonna win!", is you build a mental model of the race. The model combines a subset of known facts with your understanding of what is going on, and based on the model, you are drawing some conclusions. The model could be as simple as "the horse number 3 always wins", or very complex. When we want to be more rigorous, we are using scientific models of the world. Those models could be physical models, mathematical models, simulation models, etc. What those models have in common is that they simplify and approximate reality, so to enable us to use them to draw conclusions, understand, predict, or simulate something.

Statistical models are one of the kinds of such models. They are mathematical representations of the problem that use the data. In your question, you gave examples of linear models such as linear regression or generalized linear models, but there are many more models than linear ones. Also, you seem to be assuming that the statistical model assumes in advance the exact mathematical representation of the problem. That is not the case! For example, a regularized regression could shrink some of the parameters to zeros, effectively removing them from the equation, so the resulting function would make use of a different set of features than we assumed in our model. Polynomial regression, like a neural network, is a universal approximator, so the model assumes that the functional representation of the data could take any form. Logistic regression is the simplest kind of neural network, so even if we agreed with your definition of a model, neural networks would fit the definition. Finally, $k$NN, random forest, and other machine learning models can be written down as mathematical functions, so they would also be models in the sense of being mathematical functions representing the data.

If you prefer, instead of saying that in $$y = f(\mathbf{x})$$ $f$ is a "model", use the word "curve" here. You would be saying that you fitted linear regression "curve" to the data, random forest "curve", neural network "curve", etc. However, we usually call $f$ a (statistical or machine learning) model. You would not benefit much from not using the term though, because it would make it harder to communicate with other people, as each time you would need to explain what you mean in the place where you could use the term "model" which is a usual term for it.

Regarding your concerns:

  • Intuitively, in the real-world we do not know what is the relationship between label and data. So why would bother creating a relationship between them?

In some cases we do, then we can use models that enable us to be explicit about it, in some cases we don't, so we use things like non-parametric models.

  • In no part of modern machine learning training/validation/testing routine do we require the assumption of a clear relationship between $x_i$ and $y_i$. It literally does not provide us with any additional information.

I don't agree. If you are using some machine learning model, you are making an assumption that there is some relation between $x$ and $y$. If there is no such relation, your model cannot learn anything, and the only thing it could do is it will overfit the noise or predict something like global average, or most frequent class for all the observations.

Moreover, if you can use a model that assumes some specific relation, the model may require way less data to learn. For linear regression, a few hundred samples may be more than enough, while for a neural network thousands may not be enough.

  • In modern deep supervised machine learning, the concept of a model is not even remotely mentioned. For example, it doesn't seem to make sense to say things such as "what's the statistical model of GoogLeNet", or "what's the statistical model of LSTM". I mean, what is the functional form of $m$ and noise assumption on $\epsilon_i$, $y_i = m(x_i) + \epsilon_i$ for things like DenseNet or U-Net?

GoogleNet, LSTM, etc are the models themselves. You incorrectly assume that all models have a simple form like $y = m(x) + \epsilon$, that is not the case. Unless your question is "Are we forced to use linear models", then the answer is obviously "no", even the old-school statistics do not limit itself only to such models.

Tim
  • 138,066
  • 3
    Tim is spot on. There are many reasons for statistical models and we need to keep the difference between models and machine learning straight as I have attempted to do here. The default additivity assumption in most statistical models means, as Tim said, that the sample size required may be a fraction (often 0.1) of what's needed for machine learning. Plus with models you get uncertainty measures automatically, and you can handle censored and truncated data easily. Note also that the OP omits decades of statistical advances in models (splines, etc.). – Frank Harrell Jul 12 '23 at 11:37
  • 2
    See Modern Modeling Techniques are Data Hungry. Here "modeling" is used in a more general sense than statistical models. – Frank Harrell Jul 12 '23 at 11:46
8

It seems like your question and comments come from the same place where the sentiment—"machine learning is just curve fitting"—comes from. The sentiment is true at a low level. But statistical modeling is a useful concept because it systematically tells you what curves you should be looking for. Statistical modeling is a language to turn assumptions about data into objectives that computers can solve using observations. Sure, these objectives are well known for tasks such as plain classification or plain regression. But new tasks, new variants of tasks, or new ways of mapping tasks to existing statistical models, come up all the time. And creative specifications of the statistical model happen to be fruitful for cutting-edge machine learning research. More on that later.

Basic idea

Before demonstrating the usefulness of statistical modeling, let's roughly define it. The other answers correctly tell you to abandon the overly-constrained definition of $Y = m(x) + \epsilon$. So what's a better, more generic definition? A statistical model is a random variable, representing output data, which is a function of input data; the model specifies a random process by which input data is transformed into output data. Under this definition, unlike $Y = m(x) + \epsilon$, you don't need to get hung up on explicitly defining the distribution of random errors. (Answers to this other CV question speak a bit more to that.) For example, here's a specification of the statistical model underlying logistic regression:

$$ m(x) = \beta_0 + \beta_1 x \\ f(x) = \sigma(m(x)) \\ Y_i \sim \text{Bernoulli}(f(x_i)) \text{ independent}. $$

Two questions you may have:

  1. If I believe the output data is a deterministic transformation of the input, why would I model it as if it's random?

  2. How did specifying $Y_i \sim \dots$ make any progress on the problem of predicting new data? What's the need for this formalism?

Re the first question: a lot can be said. I'll leave it to you to read these sections of these ML textbooks (which are freely available online):

  • Section 3.1 of Deep Learning1

    In many cases, it is more practical to use a simple but uncertain rule rather than a complex but certain one, even if the true rule is deterministic and our modeling system has the fidelity to accommodate a complex rule.

  • Section 1.2.1.5 of Probabilistic Machine Learning: An Introduction2

    In many cases, we will not be able to perfectly predict the exact output given the input, due to lack of knowledge of the input-output mapping (this is called epistemic uncertainty or model uncertainty), and/or due to intrinsic (irreducible) stochasticity in the mapping (this is called aleatoric uncertainty or data uncertainty).

Now that we've established that a random model is realistic for the purposes of prediction, let's establish that $Y_i \sim \dots$ represented progress on the problem.

Go back to the logistic regression model above. To make new predictions, we just need to "fill in" the values of $\beta_0$ and $\beta_1$ using training data. A principled way to do that is to estimate them via maximum likelihood estimation (MLE). (Section 5.5.2 of Deep Learning1 summarizes some of the theoretical justification for MLE.) It's in expressing the likelihood function that the formalism $Y_i \sim \dots$ is required:

$$ \begin{align*} \mathcal{L}(\theta \: | \: Y_1 = y_1, \dots, Y_n=y_n) &= \prod_{i=1}^{n} {\Pr}_\theta(Y_i = y_i) & \text{independence} \\ &= \prod_{i=1}^{n} {\Pr}_\theta(Y_i = 1)^{y_i} (1 - {\Pr}_\theta(Y_i = 1))^{(1 - y_i)} & \text{Bernoulli} \\ &= \dots & \text{take log, simplify.} \end{align*} $$

The theoretical simplicity and empirical effectiveness of this paradigm—

  1. model output as a random variable
  2. minimize negative log-likelihood plus regularization

—is at the core of the answer to your question. It's a mistake to take this idea for granted. As Georg M. Goerg put it in their comment:

once you put on your probability distribution hat , you will find that it is much easier (and much higher success rate) to find "natural" loss functions for your ml / decision making problems compared to going by "intuition" alone.

In that comment thread, you claim that one can arrive at the mean squared error objective by appealing to Euclidean distance. Fair enough. But what objective will you use when the output is the time until an event? What objective will you use when the output depends on previous outputs? What objective will you use when the output is a rate, or a category? I'm sure there are non-statistical-looking answers to each of these questions. And perhaps some of them result in a more predictive model! But isn't the statistical modeling approach just so easy? You translate your assumptions about how the data were generated into a probabilistic process, and then maximize the likelihood of this process. Such extensibility is a sign of a good framework.

Examples from deep learning

Perhaps that rough explanation wasn't convincing enough. You may be more convinced if it's demonstrated that many novel deep learning approaches explicitly and intentionally follow the paradigm:

  1. model output as a random variable
  2. minimize negative log-likelihood plus regularization.

GPT

Model: let $U$ be a random variable indicating the index of a token / its input ID.

$$ \mathbf{u}_i = (u_{i-k}, u_{i-k+1}, \dots, u_{i-1}) \\ \mathbf{h}_i = \text{Transformer}(\mathbf{u}_i) \\ f(\mathbf{u}_i) = \text{softmax}(W \mathbf{h}_i) \\ U_i \sim \text{Categorical}(f(\mathbf{u}_i)). $$

Likelihood: see equation (1) of the first GPT paper3 for the likelihood objective, which comes from applying the probability chain rule.

Regularization: the $\text{Transformer}$ typically includes dropout, and the optimizer typically includes weight decay.

ChatGPT

The novelty of InstructGPT4 (a very close cousin of ChatGPT) is in performing Reinforcement Learning with Human Feedback (RLHF)5. An important piece of RLHF's reward modeling subproblem is formulating a loss function based on observations of preferences. Currently, most alignment work assumes preferences are binary.

Model: let $Z$ be a random variable indicating that a language model's sampled response $\mathbf{y}^{(w)}$ is preferred to an alternate sampled response $\mathbf{y}^{(l)}$ for a prompt $\mathbf{x}$. Apply the Bradley-Terry model of binary preference relations. (See section 2.2.3 of the RLHF paper.)

$$ \mathbf{h}(\mathbf{x}, \mathbf{y}) = \text{Transformer}((\mathbf{x}, \mathbf{y})) \\ r(\mathbf{x}, \mathbf{y}) = \mathbf{w}^T \mathbf{h}(\mathbf{x}, \mathbf{y}) \\ f(\mathbf{x}, \mathbf{y}^{(w)}, \mathbf{y}^{(l)}) = \sigma \big( r(\mathbf{x}, \mathbf{y}^{(w)}) - r(\mathbf{x}, \mathbf{y}^{(l)}) \big) \\ Z_i \sim \text{Bernoulli} \big( f(\mathbf{x}_i, \mathbf{y}^{(w)}_i, \mathbf{y}^{(l)}_i) \big) \text{ independent}. $$

Likelihood: see equation 2 of this paper6 (for a cleaner expression which matches my notation).

Regularization: when training the language model according to the reward model, there's a penalty for deviating too far from its initial distribution.

As new forms of human feedback are experimented with, new loss functions will need to be formulated. Such a task is easier to tackle when you specify a statistical model for how that feedback came about.

Diffusion models

This one takes much more notation, so I'm just going to point you to the concise summary in Section 2 of this paper7. Observe that the specification of the statistical model is quite contrived and creative.

Text embeddings

The top 4 (at the time this post was written) state-of-the-art approaches for embedding text rely on the same sort of loss function described in this paper8.

Model: given a batch of $n$ texts, let $Z_i$ indicate which text in the batch $\mathbf{x}_i$ is similar to. (Below, $\tau$ is a hyperparameter often referred to as temperature.)

$$ \mathbf{h}(\mathbf{x}) = \text{Transformer}(\mathbf{x}) \\ s(\mathbf{x}, \mathbf{x}') = \frac{\mathbf{h}(\mathbf{x})^T \mathbf{h}(\mathbf{x}')}{||\mathbf{h}(\mathbf{x})||_2 ||\mathbf{h}(\mathbf{x}')||_2} \\ s^{(k)}_i = s(\mathbf{x}_i, \mathbf{x}_k) / \tau \\ \mathbf{p}_i = \text{softmax}(s^{(1)}_i, s^{(2)}_i, \dots, s^{(n)}_i) \\ Z_i \sim \text{Categorical}(\mathbf{p}_i). $$

Likelihood: see, e.g., section 2.2 of the InstructOR paper9.

I wanted to highlight text embedding models because, at first glance, it seems like a purely geometric problem: all we need to do is push similar texts closer in embedding space and dissimilar texts farther. Indeed, many approaches to metric learning involve purely geometric loss functions, e.g., contrastive loss, triplet loss. If we are certain about the similarity and dissimilarities of texts in each batch, how could statistical modeling be warranted? It's hard to pinpoint a reason. I just find it relevant to your question that it is warranted, at least empirically. (Note: if the dissimilar texts are crudely randomly sampled from a pool of texts, then it's clear that modeling similarity as random is more realistic. See word2vec.10)

Object detection

I'm getting kind of tired so just read section 3 of this paper11.

Why doesn't {insert architecture paper} specify a statistical model?

There are many highly impactful ML papers which are focused purely on $m(x)$—the model's architecture. These papers develop computationally efficient ways to flow signal from the input $x$ to, ideally, arbitrary outputs. They don't need to specify statistical models because they seek to outperform other architectures on benchmark tasks, for which appropriate loss functions have already been designed.

Even if the paper includes a novel loss function, it's not always useful to verbosely specify a statistical model. More on that now.

Important disclaimer

(Treat the thoughts below as opinions.)

The statistical modeling paradigm—

  1. model output as a random variable
  2. minimize negative log-likelihood plus regularization

—is not a rule in ML, or anything really. Don't stress too much about not having a crystal-clear probabilistic interpretation of your loss function. If summing up a few different loss functions seems reasonable, then try it. If re-weighing terms in a loss function could counteract a problem you've observed, then try it. If it works, it works. The statistical modeling paradigm is useful because it's easy and fast to start with something principled, and then iterate on it or move on. There are bigger fish to fry when you're working on an ML problem. The majority of ML research is not focused on specifying statistical models.

And there are many examples in deep learning where it's not as useful to connect the loss function to a statistical model. A notable example is that training a Generative Adversarial Network is fruitfully thought of in game-theoretic terms.12

But for decades past, and probably some time to come, many of the big ideas in deep learning contain an instance of the high-level recipe: specify statistical model, maximize likelihood.

References

  1. Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.

  2. Murphy, Kevin P. Probabilistic machine learning: an introduction. MIT press, 2022.

  3. Radford, Alec, et al. "Improving language understanding by generative pre-training." (2018).

  4. Ouyang, Long, et al. "Training language models to follow instructions with human feedback." Advances in Neural Information Processing Systems 35 (2022): 27730-27744.

  5. Christiano, Paul F., et al. "Deep reinforcement learning from human preferences." Advances in neural information processing systems 30 (2017).

  6. Rafailov, Rafael, et al. "Direct preference optimization: Your language model is secretly a reward model." arXiv preprint arXiv:2305.18290 (2023).

  7. Ho, Jonathan, Ajay Jain, and Pieter Abbeel. "Denoising diffusion probabilistic models." Advances in neural information processing systems 33 (2020): 6840-6851.

  8. Oord, Aaron van den, Yazhe Li, and Oriol Vinyals. "Representation learning with contrastive predictive coding." arXiv preprint arXiv:1807.03748 (2018).

  9. Su, Hongjin, et al. "One embedder, any task: Instruction-finetuned text embeddings." arXiv preprint arXiv:2212.09741 (2022).

  10. Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advances in neural information processing systems 26 (2013).

  11. He, Yihui, et al. "Bounding box regression with uncertainty for accurate object detection." Proceedings of the ieee/cvf conference on computer vision and pattern recognition. 2019.

  12. Goodfellow, Ian, et al. "Generative adversarial nets." Advances in neural information processing systems 27 (2014).

chicxulub
  • 1,420
1

I think the distinction you are making is between 'data models' and 'algorithmic' models, in the terminology of Breiman's famous paper, statistical modelling - the two cultures

It's a well known distinction between the black box modelling of ML and traditional statistical models of scientific analysis.

You are right that fewer assumptions need to be made if one is interested in predictive accuracy alone. Nevertheless you still have to make assumptions to conclude that good results on your test set will translate to good results on new data.

Similarly the assumption of additive gaussian noise implies optimality properties for the MSE criterion. If instead you assumed your data has a different error distribution, then a different objective function would be optimal ( eg absolute deviation is recommended for heavy tailed data) Similarly if the noise is multiplicative you would likely take logs before using MSE.

seanv507
  • 6,743
  • Semiparametric models largely solve the distributional assumption problem you described. And I'm one of the few statisticians who finds that Breiman's once timely and brilliant paper hasn't aged extremely well due to the fact that statistical models are much more flexible than when he wrote the paper. – Frank Harrell Jul 12 '23 at 11:48
  • Thank you for this answer. It would be great if you can define in clearer terms the difference between data and algorithmic model. – Curaçao Hajek Jul 14 '23 at 20:11
  • Also another point that I would like to raise is that, the objective function and it's optimality does not necessarily have to do with the error distribution. Linking Gaussian with MSE is not natural (requires derivation); but using a MSE as an error metric by itself IS natural, it is simply the difference between the true label and the predicted label in the Euclidean distance. This application does not require the concept of Gaussian or even probability distribution (always unknown). – Curaçao Hajek Jul 14 '23 at 20:13
  • In other words, I am not sure about the whole error distribution of my data (can this even be known?? My data is images of dolphins, octopus and orcas - what's the error distribution). However, the objective function is a distance between the target and the prediction in a geometric space. The appropriateness of the geometric space implies its optimality. At least this is would I think. – Curaçao Hajek Jul 14 '23 at 20:14
  • 4
    @CuraçaoHajek the geometry you refer to is implied by the probabilistic nature/property of your space.Why do you think it's "natural" to use square loss? That's not true for, say, binary labels or categorical truth.the geometry of the probability space gives you a more "natural" way to measure loss (cross entropy). It seems to me you imply that ppl just came up with loss functions intuitively and only through 'experiments' they were shown to work. That is somewhat backwards. MSE is "natural" is because a gaussian error distributions is good proxy for euclidean space-not the other way around – Georg M. Goerg Jul 15 '23 at 14:19
  • 4
    @CuraçaoHajek or from a productive/ success view: once you put on your probability distribution hat , you will find that it is much easier (and much higher success rate) to find "natural" loss functions for your ml / decision making problems compared to going by "intuition" alone. – Georg M. Goerg Jul 15 '23 at 14:23
0

I think I may see part of the misunderstanding here:

"Intuitively, in the real-world we do not know what is the relationship between label and data. So why would bother creating a relationship between them?"

because a key goal of the analysis may be to understand or quantify the relationship between them, rather than just to make good predictions. We want to understand the data generating process.

When we use a linear model (they are used in machine learning as well), it isn't because we think that the real world data follow an exactly linear law, just that it is a reasonable approximation that we can understand. This is not unreasonable, in science and engineering we often use Taylor series expansions of non-linear relationships, and a linear model is just a first order Taylor series approximation. If that approximation is good then the model can help us understand the data. All models are approximations - hence "all models are wrong, but some are useful".

However, as you point out, statisticians also use non-linear and more opaque non-parametric methods, where again there is little or no assumption made about the true form of the relationship between inputs and outputs.

If you are looking for a clear distinction between machine learning models and statistical models, you won't find one, because it doesn't exist. Almost all of machine learning is a branch of statistics. If you want a machine learning book that emphasises this, I can recommend the works of Chris Bishop and Kevin Murphy,

Dikran Marsupial
  • 54,432
  • 9
  • 139
  • 204