29

I am wondering why Machine Learning needs a lot of data compared to statistical inference. In statistics, we can use a small amount of data for a statistical inference but in Machine Learning, everybody says we need a lot of data. Why does Machine Learning needs tons of more data compared to Statistics Inference?

xabzakabecd
  • 3,455
  • 21
    machine learning models typically have thousands to millions free parameters (or more). doing statistical inference on models of that size will require the same amount of data as machine Learning techniques – J. Delaney Apr 20 '22 at 17:44
  • 1
    I think you have it backwards. The advantage ML gives you is the ability to accommodate vast amounts of data, and make inferences from that. Since dealing with so much data is impractical for "classical" statistics inference, we try to limit the number of free parameters (i.e. we make assumptions that may be entirely wrong). If you try to validate your statistical inferences against the same vast amounts of data, you'll generally see much worse results than from ML. – Luaan Apr 21 '22 at 09:17
  • 7
    That is misleading. In the low signal:noise case (i.e., excluding visual and audio pattern recognition, language translation, etc.) comparisons of statistical models and machine learning are not showing clear victories for machine learning, and the sample size required for machine learning is far higher. Most of the comparisons in this setting (e.g., health outcomes) that have found to favor machine learning have assumed that statistical models assume linear effects of predictors, which is quite wrong. More at https://fharrell.com/post/stat-ml – Frank Harrell Apr 21 '22 at 11:19
  • @J.Delaney Think about LInear Regression in Statistaic as well as Machine Learning, I never heard that there are a lot of data needed when doing Linear Regression in Statistics but not in Machine Learning. Even with the equal number of parameters, Machine Learning people say there need a lot of data to be more accurate. – xabzakabecd Apr 21 '22 at 13:51
  • 2
  • 5
    The main explanation is that ML effectively allows for tons of interactions, and in the best case an interaction takes 4 times the sample size to estimate with the same precision as a main effect. In stat models we are selective about inclusion of interactions. – Frank Harrell Apr 21 '22 at 14:53
  • 1
    @oceanus "Machine Learning" is a very broad and vague term. Linear regression is not that different in ML vs standard statistics, so it's hard to tell what the things you heard actually refer to – J. Delaney Apr 21 '22 at 16:06
  • @FrankHarrell You seem to be using "interaction" in a specific sense of an interaction effect. I don't think that yields a 'main' general explanation because many of the kinds of non-linearity used in ML models are not equivalent to interaction effects. – Galen Apr 21 '22 at 18:26
  • 1
    Nvidia's new AI learns to edit photos with only 16 examples (from last week). The parts of each input-photo needed to be labeled by hand to reduce the search-space. – BlueRaja - Danny Pflughoeft Apr 21 '22 at 23:49
  • 2
    @BlueRaja-DannyPflughoeft and uses StyleGANs pre-trained on millions of images :) – Firebug Apr 22 '22 at 08:34

6 Answers6

35

All/other things being equal (when?) machine learning models require similar quantities of data as statistical models. In general statistical models tend to have more assumptions than machine learning models and it is these additional assumptions that give you more power (assuming they are true/valid), which means that smaller samples are needed to obtain the same confidence. You can think of the difference between statistical/machine learning models as a difference between parametric and non-parametric models.

Complex models (which are more prevalent in machine learning) with many parameters do require more data (such as deep NN), but it has to do with the parameters and not the models themselves. If you built a complex statistical model with many interactions and polynomial terms you would similarly need large amounts of data to estimate all the parameters (unless you are Bayesian... then you do not even need data!).

user2974951
  • 7,813
  • And of course, it should be noted that wrong assumptions are at fault for a lot of broken statistical models. It doesn't matter how good your data is if the problem isn't in the data. You can of course do machine learning the same way - inputting your own assumptions into the models, and getting away with needing a lot less training data. But as it turns out, when you actually put in the huge amounts of data, you quickly find pretty much all of the assumptions are wrong - though obviously, not all "wrongs" are equal. The advantage of ML is its ability to accommodate such vast amounts of data. – Luaan Apr 21 '22 at 09:11
  • 1
    That is not very helpful. The question is whether the assumptions that statistical models make are reasonable. The key assumption is additivity of effects (except for pre-defined interactions) which turns out to be remarkably good in many datasets that are not of the image recognition variety. It is ironic that in the highest dimensional cases (whole genome analysis) one must revert back to statistical models such as lasso because the problem isn't tractible without assuming additivity. – Frank Harrell Apr 21 '22 at 11:14
  • @FrankHarrell I am not sure that the LASSO is not also an ML model as Laplace priors were already being used with neural networks at that time (and a linear model would be a special case). – Dikran Marsupial Apr 21 '22 at 11:33
  • lasso is a statistical model. Using Laplace priors in ML is related to lasso but lasso assumes predominant additivity. – Frank Harrell Apr 21 '22 at 12:00
  • 1
    @FrankHarrell How does LASSO differ from a single layer neural network with a Laplace prior? AFAICS the models are structurally the same and optimise the same criterion. – Dikran Marsupial Apr 21 '22 at 13:38
  • @DikranMarsupial (+1) Also check The Highly Adaptive Lasso Estimator it is really good. – usεr11852 Apr 21 '22 at 14:47
  • 1
    If you want to restrict to single layer (not sure why) then you have a point. – Frank Harrell Apr 21 '22 at 14:52
  • 5
    My point is that there is no real distinction between good practice in statistical models and machine learning models, it is mostly a difference in terminology (they tend to differ most in their bad practices - class imbalance being a good example). I did say that the LASSO in statistics is a special case of a Laplace prior in neural nets. Back in the day we used skip-layer connections so that if you had a linear problem the Laplace prior is likely to prune it back to a linear model. The point was that it isn't "revert[ing] back to statistical models" - the concept existed in ML already. – Dikran Marsupial Apr 21 '22 at 14:57
  • 1
    The reason for restricting to a single layer here would be exactly the reason you gave - tractability. – Dikran Marsupial Apr 21 '22 at 15:00
  • @FrankHarrell What is "predominant additivity"? – Galen Apr 21 '22 at 18:52
  • 1
    It means that for most of the predictors we assume they are separable on a suitably chosen transformed Y scale. For example for binary Y we model additive nonlinear effects of predictors on the logit (log odds) scale without allowing the majority of predictors to change each others' effect, e.g., without multiplying them together so as to allow for interaction. – Frank Harrell Apr 21 '22 at 19:14
  • @user2974951 What do you mean by "unless you are Bayesian... then you do not even need data!"? Hope it to be more explainable. – xabzakabecd Apr 22 '22 at 02:24
  • @oceanus That was a joke (implying that the priors alone are enough to get the posterior... technically). – user2974951 Apr 22 '22 at 05:39
22

Well, you could do inference with a small amount of data. We just have concepts like statistical power to tell us when our results would be reliable and when they would not be.

In general, lots of data is needed in machine learning to overcome the variance in estimators/models. Trees, as an example, are incredibly high variance estimators. The only real way to combat that is to add more data since the variance shrinks proportional to $1/n$.

  • What does $1/\sqrt{n}$ mean? – xabzakabecd Apr 20 '22 at 16:24
  • Here, $n$ is the number of samples. – Arya McCarthy Apr 20 '22 at 16:26
  • 1
    I understand what you meant by $ n $ but I do not well understand your saying "since the variance shrinks proportional to $ 1/\sqrt{n} $". Could you stretch your explanation a bit more for that line with the number? – xabzakabecd Apr 20 '22 at 16:28
  • 1
    The variance of what? The estimated variance of a population is more-or-less independent of sample size. The variance in your estimate of the mean of a parameter goes as $1/n$. – ProfRob Apr 21 '22 at 13:59
  • 2
    +1 " We just have concepts like statistical power " or look at the predictive uncertainty of the model, which often indicate that the model isn't telling you anything useful about the problem. – Dikran Marsupial Apr 21 '22 at 15:04
  • 1
    @ProfRob Variance of a machine learning model/estimator, not variance in data. The more complex a model is, the more it can learn, but the more data it needs so it doesn't memorize the data sets it sees. That would lead to low generalization capability of the model, on previously unseen data. A complex model is said to have high variance, and a simple model to have high bias. For example, have a look at https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff or https://datascience.stackexchange.com/questions/37345/what-is-the-meaning-of-term-variance-in-machine-learning-model. – ivanbgd Apr 21 '22 at 16:14
  • The links aren't that helpful (to me). The first suggests the variance is (as I would expect) the (difference between prediction and truth)^2 - and I can't see why that would depend on $n$ at all. The second link has a highly-rated, accepted answer that doesn't clearly define the variance (even though that is the question asked) . From what I can gather, it is being used as a synonym for the precision^2 of the prediction on unseen data? Why does that depend on $n^{-1/2}$? Or should I ask a separate question... – ProfRob Apr 21 '22 at 17:10
  • @ProfRob "The variance of what" -- the variance of the prediction. You have to consider the prediction as a random variable conditional on the training set $E[Y\vert X_\mathcal{T}$]. Were I to get different training sets, then my prediction would change. If training data are small then this variance is quite large. More data makes the variance smaller. – Demetri Pananos Apr 21 '22 at 18:17
  • @ProfRob If you have a complex and powerful ML model and you feed it with little data during training, it will memorize the data set by heart, because it's powerful enough. If you feed it with another, different training set, it will memorize it by heart again. It varies with data set. That's why it has high variance. That's bad in case it has to predict something on data it hasn't seen before. The good thing is that it can learn complex dependencies, but you need to give it a huge amount of data so it cannot learn it by heart. Then, it will make less errors when predicting with new data. – ivanbgd Apr 21 '22 at 18:21
  • @ProfRob Low or high bias or variance are properties of ML models and not of data in this case. High bias models are simple and can't learn complex hypotheses, even with a lot of data. Having a complex model is a good thing, but you need to give it a lot of training data so it doesn't vary with data set. You actually want to reduce its initial high variance. There are techniques like regularization that deal with that if there's not enough data, but that's a different topic. It's generally better to have a powerful model and a lot of data. More data will reduce the inherent model's variance. – ivanbgd Apr 21 '22 at 18:28
  • @ivanbgd thanks for those comments - useful. – ProfRob Apr 21 '22 at 19:07
  • @ProfRob You're welcome. – ivanbgd Apr 21 '22 at 20:23
20

Machine learning does not require large amounts of data, it is just that the current bandwagon is for models that work on big data (mainly deep neural networks, which have been around since the 1990s, but before that it was SVMs and before that "shallow" neural nets), but research on other forms of machine learning has continued. My own personal research interests are in model selection for small data, which is far from a solved problem, just not in fashion. Another example would be Gaussian Processes, which are very good where a complex (non-linear) model is required, but the data are relatively scarce.

It is a pity that there is so much focus on deep learning and big data as it means that a lot of new practitioners are unwaware of research that was done 20 or more years ago that is still valid today, and as a result they are falling into many of the same pitfalls that we found back in the day. Sadly ML and AI goes thorough these cycles of hype and doldrums.

At the end of the day though, ML is just statistics, but a more computationally focussed branch of statistics.

Dikran Marsupial
  • 54,432
  • 9
  • 139
  • 204
  • 5
    A bit harsh... (but +1) To play devil's advocate here: GPs if anything fail with larger amounts of data as the kernel matrix gets unwieldily large. (Yeah, I know sparse kernels exist.) – usεr11852 Apr 21 '22 at 14:44
  • 8
    @usεr11852 indeed, they are a good example though that not quite all of ML is currently obsessed with big data and deep learning! I started in ML in 1990, so I have seen a few hype cycles first hand ;o) – Dikran Marsupial Apr 21 '22 at 15:40
  • 1
    Yep! I hate how the current approach is "just shove it into NN, report good accuracy, no knowledge extracted whatsoever". Probably over 90% of articles are pure hype, in 2 years' time all meaningful content in them besides "ML could be applied to problems in X" erodes and becomes irrelevant with a new off-the-shelf general purpose model. – Lodinn Apr 23 '22 at 19:09
  • 1
    One-shot learning algorithms typically don't require large amounts of training data, for example. – Anderson Green Nov 15 '22 at 17:31
6

Machine learning (often) needs a lot of data because it doesn't start with a well defined model and uses (additional) data to define or improve the model. As a consequence there are often a lot of additional parameters to be estimated, parameters or settings that are already defined a-priori in non-machine-learning methods.

  • Statistical inference, if it only requires little data, is often performed with some model that is already known/defined before the observations are made. The learning has already been done.

    The goal of the inference is to estimate the few missing parameters in the model and verify the accuracy of the model.

  • Machine learning is often starting with only a very minimal model or has not even a model but just a few set of rules from which a model can be created or selected.

    For instance, one learns which variables are actually suitable to make good predictions or one uses a flexible neural network to come up with a function that fits well and makes good predictions.

    Machine learning does not just search for a few parameters in an already fixed model. Instead it is the model itself that is being generated in machine learning. For that you need additional data.


Sometimes it is also the other way around: a lot of data needs machine learning. That is the situation with lots of variables but without a well defined model.

  • 1
    +1: I think this is the best answer and I think James, Witten, Hastie, and Tibshirani would agree as they write "But non-parametric approaches do suffer from a major disadvantage: since they do not reduce the problem of estimating $f$ to a small number of parameters, a very large number of observations (far more than is typically needed for a parametric approach) is required in order to obtain an accurate estimate for $f$". Admittedly, there are parametric machine learning methods and non-parametric statistical ones but I think this is the heart of the matter when it comes to "why more data?" – ColorStatistics Apr 30 '22 at 15:02
  • @ColorStatistics I notice now that I wrote "There are a lot of parameters to be estimated." and that is the exact same thing that I criticised in Mayou's answer. It is true that machine learning comes with extra parameters but I wouldn't say that machine learning is the exact same as a non-parametric approach. To me it is more in how the modelling is approached and whether we have an a-priori fixed model or whether we have an algorithm that generates the model based on data. – Sextus Empiricus Apr 30 '22 at 15:33
  • We're in agreement here. – ColorStatistics Apr 30 '22 at 15:39
  • @ColorStatistics Can we say that inductive bias also reduces the need for large amount of data? Because even if we compare two parametric approaches e.g. CNN and MLP, the CNN with less amount of data (but same or more parameters) can perform better or equal to a MLP for image classification. So, the domain knowledge we have for a given problem (how to structure the architecture) matters when we have a parametric vs parametric scenario? – ado sar Jul 24 '23 at 18:38
3

A typical machine learning model contains thousands to millions of parameters, while statistical modelling is typically limited to a handful parameters.

As a rule of thumb, the minimum an amount of samples you need is proportional to the amount of parameters you want to estimate. So for statistical modelling of a handful of parameters you might only need a hundred samples, while for machine learning with millions of parameters you may need millions of samples.

  • 5
    I don't think the first paragraph is true. ML models do not necessarily have large numbers of parameters (sparse modelling used to be a common topic in ML) and large non-parametric models are used in both ML and statistics. – Dikran Marsupial Apr 21 '22 at 11:25
  • 2
    @DikranMarsupial We can distinguish between necessarily and typically. I agree with you that ML models do not necessarily have large numbers of parameters. The models I train tend to have $\sim$1-10$^{7}$ parameters, but I don't know what it typically true across ML projects. – Galen Apr 21 '22 at 18:57
  • 1
    @AgnesianOperator sure. I quite like Gaussian Processes and they don't actually have parameters as such, just hyper-parameters, and inference is mostly about that small handful of hyper-parameters. So you can have complex ML models that have fewer parameters (i.e. none + a few hyper-parameters) that even simple statistical models. However, I suspect outside those using deep learning toolboxes, the size of models decreases very quickly. – Dikran Marsupial Apr 21 '22 at 19:05
  • @DikranMarsupial Interesting, I had never thought about Gaussian processes as not having parameters. I don't have rigorous definition of hyperparameter in mind. Can you tell how you would define it here? – Galen Apr 21 '22 at 20:14
  • 2
    @AgnesianOperator I think GPs are a bit unusual as they don't really have parameters the term "hyper-parameter" doesn't really make much sense. I think it is more that the covariance parameters of a GP are analogous to the kernel and regularisation hyper-parameters of a kernel method (which do have parameters) so they are called hyper-parameters for that reason. FWIW my view of hyperparameters is given as an answer to this question https://stats.stackexchange.com/questions/149098/what-do-we-mean-by-hyperparameters – Dikran Marsupial Apr 21 '22 at 21:45
-1

Machine Learning and Statistical inference deal with different type of problems and are not comparable in this point of view.

Statistical inference is used in problems that are inherently statistic, for example, if there was ten days raining then next day will more probable (using Bayesian approach) be raining as well, no need for more data.

But in machine learning, some features or patterns which exist in data must be learned. For an example in classification with machine learning, it first must learn (given a lot/enough/balanced learning data) to classify between pictures of cats and dogs, and then after learning phase, the problem in inference phase is that we show it a picture and it should tell us whether it is a cat or a dog. Now suppose that we show it 10 pictures of cats to infer and classify, and all was successfully inferred. Now, does the probability of 11th picture to be cat matter for that machine? No, because it should classify that picture based on its learned abilities to discover a cat, not the probability of being a cat after 10 cats.

  • 1
    I'm afraid you got the inference wrong in both examples. This specific issue in ML is known as class imbalance and still absolutely happens, but so it does in statistics. There is no good reason even to infer more rain in your example, and it's certainly not a property separating the two approaches. – Lodinn Apr 23 '22 at 19:13